The Agent Stack Is Being Rewritten

Written by

in

Orchestration, skills, and security — the year agent research grew up.

May 2025 – May 2026 | 37 papers surveyed

A year ago, if you wanted to build an AI agent, you picked a framework: LangGraph, CrewAI, AutoGen, Google ADK, OpenAI Agents SDK. These frameworks — collectively exceeding 290,000 GitHub stars — defined the architectural orthodoxy: an external orchestrator sits above the LLM, injecting instructions and routing decisions every turn. It felt obvious. It felt necessary.

Then the research caught up. Over 37 papers surveyed from May 2025 to May 2026, the agent research community systematically questioned every assumption underlying the orchestration paradigm. The result is a field in productive upheaval. The external orchestrator model is being challenged by weight-compilation approaches. The skill management pipeline has been systematically analyzed and its bottlenecks identified. And multi-agent systems moved from toy demos to real coordination problems with shared resources and temporal dynamics.

Here’s what happened — and why the agent stack as you know it may not survive the next year intact.

The Skill Management Revolution

The year’s most significant agent research cluster concerned how agents acquire, manage, and improve skills at inference time. This isn’t about making agents that can write code or browse the web — that’s old news. It’s about making agents that can improve themselves systematically, without weight updates.

“SkillOpt: Executive Strategy for Self-Evolving Agent Skills” (Yang et al., May 2026) made the case that skill acquisition should be treated with the same rigor as weight-space optimization. Rather than ad-hoc self-revision or one-shot generation, SkillOpt treats the skill as trainable external state and applies a systematic optimization strategy over the text-space representation. The paper introduces three mechanisms: skill pruning (removing skills that hurt performance), skill merging (combining related skills), and skill scheduling (deciding when to use which).

This is a conceptual shift: skills are not knowledge that the agent happens to record — they are parameters in an external optimization loop. Just as you wouldn’t train a neural network without a learning rate schedule and gradient clipping, SkillOpt argues you shouldn’t manage agent skills without analogous mechanisms. The paper demonstrates that systematic skill optimization consistently outperforms ad-hoc skill accumulation, and the advantage grows with task diversity.

“From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills” (Huang et al., May 2026) provided the first comprehensive study spanning the full skill pipeline from extraction to consumption. The paper’s key finding — that domain-level and model-generated skills are finally becoming viable at scale — marks a turning point. The field now has systematic evidence rather than anecdotal success stories. The authors paid particular attention to domain-level skills (as opposed to general-purpose heuristics) and found that model-generated skills in specialized domains — medicine, law, engineering — are finally reaching the quality threshold for practical deployment.

> Why this matters: SkillOpt provides the optimization framework for skills; From Raw Experience provides the empirical characterization. Together, they establish that skill management is a solvable optimization problem — not an irreducible challenge of agent design.

Compilation vs. Orchestration: The Year’s Most Provocative Challenge

“Compiling Agentic Workflows into LLM Weights: Near-Frontier Quality at Two Orders of Magnitude Less Cost” (Dennis et al., May 2026) is arguably the most provocative paper in the entire agent research corpus this year. The authors started with a simple observation: agent orchestration frameworks (LangGraph, CrewAI, etc.) all follow the same pattern — an external orchestrator above the LLM, injecting instructions and routing decisions every turn. This pattern has become so dominant that it’s rarely questioned.

Why this matters: If correct, this finding challenges the premise of the multi-billion-dollar agent framework ecosystem. It suggests that for a large class of tasks — specifically procedural tasks with well-defined workflows — external orchestration is pure overhead.

Their finding: for procedural tasks, this architecture is dominated by simply providing the procedure in the system prompt of a frontier model — at two orders of magnitude less cost. The paper goes further, showing that the procedures can be compiled into the model’s weights, effectively internalizing the orchestration logic. The results are striking: near-frontier quality on procedural tasks at roughly 1% of the compute cost of running an external orchestrator alongside a frontier model.

> Key insight: For strictly procedural tasks, weight compilation dominates external orchestration. But for dynamic tasks requiring tool selection and environmental interaction, external orchestration may remain necessary. The frontier is characterizing when each wins.

Tool Use and Multimodal Agents

“ETCHR: Editing To Clarify and Harness Reasoning” (Zhang et al., May 2026) — covered across multiple topics in this digest — is fundamentally an agent paper. It uses a dedicated image editing tool as an external module that the reasoning model calls when needed. The decoupled design (editing model + understanding model) is a tool-use architecture where the tool (image editing) is specialized and the orchestrator (understanding model) decides when to invoke it. This is a concrete example of the external orchestration approach that the compilation paper challenges — illustrating the productive tension between these paradigms.

“SPACENUM: Revisiting Spatial Numerical Understanding in VLMs” (Zhang et al., May 2026) examines a critical capability gap for embodied agents: can vision-language models genuinely ground numerical outputs (action magnitudes, spatial coordinates) in spatial perception, or are they generating statistically plausible numbers? The paper’s SpaceNum framework reveals the latter — a finding with direct implications for any agent operating in physical environments.

Multi-Agent Systems Get Real

“CHRONOS: Temporally-Aware Multi-Agent Coordination for Evolving Data Marketplaces” (Chandra, May 2026) addressed a specific but widely generalizable multi-agent coordination problem: multiple agents sharing a differential-privacy budget over a temporally evolving knowledge graph. The three-layer architecture (neural-ODE temporal decay, time-aware Shapley pricing, coordinated DP budget management) provides a unified treatment of temporal, economic, and privacy constraints.

Why this matters beyond data marketplaces: The coordination mechanisms — shared resource allocation, temporal awareness, incentive-compatible pricing — apply to any multi-agent system with shared resources. As multi-agent deployments grow (autonomous fleets, distributed sensor networks, collaborative AI systems), these patterns become essential infrastructure rather than academic curiosities.

Embodied Agents

“Leveraging Foundation Models for Causal Generative Modeling” (Komanduri & Wu, May 2026) — FM-CGM — formalizes end-to-end visual causal reasoning using pretrained foundation models. For embodied agents, the ability to reason causally about visual scenes is a prerequisite for reliable real-world action.

“Robotic Strawberry Harvesting with Robust Vision and Deep Reinforcement Learning based Sim-to-Real Control” (Bashir et al., May 2026) demonstrated a complete closed-loop agentic system: vision segmentation → RL-based planning → ROS-based execution. A full-stack agent deployment — from perception through decision-making to physical action — that works in real-world agricultural conditions.

Looking Forward

The agent research of 2025–2026 reveals a field productively questioning its own foundations. The external orchestrator model is being challenged by weight compilation, and the debate between them will define agent architecture for the next year. Skill lifecycle management is now systematically understood. Multi-agent coordination has moved from toy problems to real resource-sharing constraints.

The open question is where the balance lies between compilation and orchestration. For strictly procedural tasks, weight compilation seems to dominate. But for tasks requiring dynamic tool selection, environmental interaction, and real-time adaptation, external orchestration may remain necessary. The research frontier for the next year will be characterizing exactly when each approach wins — and whether hybrid architectures can capture the best of both.

What’s clear is that the agent stack is being rewritten. The frameworks we reach for today may not be the ones we reach for tomorrow.

This article is part of the Frontier AI Research Digest backfill series, surveying 37 papers from May 2025 – May 2026. Views expressed are the author’s research synthesis, not affiliate endorsements.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *