34 papers surveyed | A year of progress in reasoning and inference-time compute scaling (May 2025 – May 2026)
—
For most of the last decade, the AI inference pipeline looked the same: you train a model, deploy it, and every query costs the same amount of compute. A simple factual lookup gets the same computational budget as a multi-step math proof.
That’s changing — fast.
Over the past year, a wave of research has reframed inference as a resource allocation problem rather than a fixed pipeline. Three themes define this shift:
1. Inference-time compute as a first-class resource — dynamically allocating compute per request, looping layers, and trading latency for depth.
2. Visual and multimodal reasoning — moving beyond text-only chain-of-thought into “thinking with images.”
3. Scaling theory meets practice — new information-theoretic frameworks suggesting fundamental limits on reasoning capacity.
Let’s dive into the research driving this transformation.
—
Looping Your Way to Better Reasoning
The year’s most immediately practical finding comes from “Training-Free Looped Transformers” (Chen et al., May 2026). The idea is deceptively simple: take a frozen, already-trained model, identify a mid-stack block of layers, and loop it at inference time. The model effectively gets “more thinking time” on harder problems without any additional training.
The non-trivial finding? Naive block reapplication — just running the same layers again — degrades performance. But with the right loop point and a lightweight wrapper, you get meaningful gains. This is test-time compute scaling applied to any existing checkpoint, and it works today.
Why does this matter for reasoning specifically? Because it addresses a fundamental asymmetry: not all inputs require the same depth of reasoning. A factual lookup like “what’s the capital of France?” doesn’t need chain-of-thought, while a complex reasoning problem like “synthesize findings from ten research papers” benefits from additional processing. Training-Free Looped Transformers give you a way to allocate more compute to the latter without retraining the model.
Complementing this architectural approach, “ModeSwitch-LLM” (Sunesh et al., May 2026) tackles the system-level side: dynamically routing each query to the most appropriate inference mode (FP16, quantization, speculative decoding, hybrid). A simple factual lookup doesn’t need the same compute as a complex reasoning chain — and cheap workload-level features suffice to make the routing decision.
Together, these papers signal that the field is converging on dynamic compute allocation as a core design principle. The question is no longer “what’s the best single inference mode?” but “how do we orchestrate multiple modes based on input difficulty?”
—
Visual Reasoning: When Words Aren’t Enough
Text-only chain-of-thought hits fundamental bottlenecks for spatially-grounded questions. Try describing a cube rotation in words vs. showing it — the limitation is obvious once you see it. This year’s research began treating this as a first-class problem rather than an edge case.
“ETCHR: Editing To Clarify and Harness Reasoning” (Zhang et al., May 2026) addresses this with a modular paradigm: use a dedicated image editing model for visual intermediate steps and a separate understanding model for reasoning. The results are clear: specialization beats unification. Rather than trying to cram visual manipulation and logical reasoning into a single model — which produces noisy, unreliable intermediate images — ETCHR lets each specialized model do what it does best.
“SPACENUM” (Zhang et al., May 2026) asked a more troubling question: when VLMs output spatial coordinates, are they genuinely grounded in perception? The answer is mostly no. Current models produce numerically plausible outputs that lack genuine spatial understanding. The numbers are in the right range and follow plausible patterns, but they don’t correspond to actual spatial reasoning. For robotics and embodied AI, this is a critical finding: if you’re using a VLM to guide physical action, the numerical outputs may be confident-sounding hallucinations.
“PGT: Procedurally Generated Tasks” (Assouel et al., May 2026) takes a data-centric approach to visual grounding: overlaying geometric primitives on images creates dense supervision that forces models to attend to fine-grained visual details. The dual-purpose design — training and diagnostics — is elegantly simple: the same procedurally generated tasks that improve grounding also reveal exactly what kind of perception failures a model has.
—
The Scaling Theory That Changes Everything
“LLMs as Noisy Channels” (Ouyang et al., May 2026) reframes training as information transmission over a noisy channel. The resulting Shannon Scaling Law explains non-monotonic phenomena — catastrophic overtraining, quantization degradation — that the old power-law framework couldn’t.
For reasoning, this has profound implications. If a model’s channel capacity (determined by its parameters) cannot accommodate the information needed to solve a reasoning problem, no amount of inference-time looping or chain-of-thought engineering will bridge that gap. This is the most important open question facing the inference-scaling field: can test-time compute effectively expand a model’s effective capacity beyond its information-theoretic bounds, or are those bounds fundamental?
“Complete-muE” (Peng et al., May 2026) solves the practical bottleneck of transferring hyperparameters from dense to MoE models — infrastructure that enables next-generation reasoning architectures where expert capacity is allocated per domain.
—
Reasoning in Agentic Systems
Several papers explored reasoning embedded in agentic contexts, and their findings form a coherent narrative:
“SkillOpt” (Yang et al., May 2026) treats skill acquisition as optimization in text-space — the agent’s reasoning about which skill to apply and how to improve it is meta-reasoning that benefits from disciplined optimization rather than ad-hoc revision.
“Ratchet” (Zhang et al., May 2026) reveals that the real bottleneck in agentic reasoning isn’t writing skills — it’s managing them over time. The model’s ability to reason about its own skill library — when to retrieve a skill, when to retire an outdated one, when to combine skills — determines whether self-improvement actually works. Without this meta-cognitive capability, skill pools accumulate noise that degrades performance.
“Compiling Agentic Workflows into LLM Weights” (Dennis et al., May 2026) is the most provocative paper in this space. For procedural tasks, the external orchestration layer (LangGraph, CrewAI, etc.) may be unnecessary overhead. Compiling the workflow directly into the model’s weights achieves near-frontier quality at dramatically lower cost. This suggests that for many reasoning tasks, the reasoning structure itself belongs inside the model, not in an external controller.
—
Physical Reasoning and World Models
“LaMo: Self-Supervised Latent Motion Priors for Physical Realism in Video Generation” (Jiang et al., May 2026) extracts motion priors from unlabeled video without external simulators — a form of implicit physics reasoning that learns what motions are plausible from observation alone. For reasoning systems that need physical intuition — robotics, autonomous vehicles, simulation — this kind of self-supervised prior could provide grounding that purely textual reasoning lacks.
“Not Too Generative, Not Too Discriminative: The Human Alignment Sweet Spot” (Chang Ortega et al., May 2026) found that human-like visual representations require a hybrid of generative and discriminative training objectives. This has implications for reasoning: the training objective itself shapes the kind of reasoning the model can perform. Models trained only discriminatively may excel at classification but lack the generative understanding needed for counterfactual reasoning.
—
Looking Forward
The 2025-2026 inference-scaling narrative is clear: we’re transitioning from “train one big model and serve it uniformly” to dynamically allocating inference compute based on task difficulty, query type, and desired reasoning depth.
The biggest open question — perhaps the most consequential in all of AI research today: Can the Shannon Scaling Law predict the limits of test-time compute scaling? If reasoning capacity depends on information-theoretic parameters alone, then no amount of inference-time cleverness overcomes those limits, and the path forward runs through pre-training innovation. But if test-time compute can effectively expand a model’s capacity — through looping, search, or recursive self-improvement — then inference optimization becomes as important as training.
This is the question that will define the next year of reasoning research.
—
Part of the Frontier AI Research Digest backfill series (May 2025 – May 2026). 34 papers surveyed.

Leave a Reply