51 papers analyzed | May 2025 – May 2026
—
If you think the biggest AI breakthroughs come from bigger models trained on more data, you’re looking in the wrong place.
The most important AI research of 2025–2026 didn’t happen during pre-training. It happened after — in the post-training phase where models learn to actually be useful, to follow instructions, to reason reliably, and to act as agents. And the tool driving this revolution? Reinforcement learning, applied in ways that go far beyond the chat-based RLHF the industry broadly adopted in 2023.
Fifty-one papers surveyed across this frontier paint a clear picture: post-training is the new pre-training. It’s where bias is introduced, where reasoning style is determined, where agents learn skills, and where the difference between a mediocre model and a transformative one is made.
Let’s dive into what the research actually says.
—
The Big Bomb: Bias Comes from Post-Training, Not Pre-Training
The year’s most consequential finding comes from Bladon & Bent (May 2026): geopolitical bias in LLMs originates in post-training, amplified by the language of the prompt.
This work — covered across multiple topics in the Frontier AI Research Digest — is perhaps the year’s single most important post-training result. By testing base model vs. chat model pairs from seven different labs, the researchers isolated exactly where bias enters the pipeline. The answer is unambiguous: pre-trained base models are relatively neutral. It’s the RLHF and instruction-tuning stages — the very steps designed to make models “helpful and harmless” — that systematically inject geopolitical bias.
The mechanism matters here. The authors found that the bias isn’t uniform; it’s amplified by the language of the prompt itself. Different phrasings produce different bias profiles from the same model, suggesting that the post-training process encodes not just preferences but specific cultural and political framings embedded in the human annotations and reward signals used during alignment.
This is a profound reframing of how we think about AI safety. For years, the AI safety community has worried about data curation during pre-training — filtering toxic content, balancing representation, scrubbing personally identifiable information. But this research suggests the real problem is downstream. It’s in the reward signals, the preference datasets, and the human annotations that shape post-training. The cure — better reward design, more representative preference data, more careful construction of alignment targets — lies in how we train, not what we train on.
> Key implication: If you care about building aligned AI, audit your post-training pipeline — not just your pre-training corpus.
—
RL for Skill Acquisition: Training Agents, Not Prompting Them
The second major theme of the year was the growing convergence of reinforcement learning and agentic skill learning. Three papers define the trajectory and collectively point toward a future where agents are trained end-to-end for specific capabilities rather than prompted into competence.
Spreadsheet-RL: Domain-Specific Training Beats General Prompting
Chi et al. (May 2026) tackled a deceptively hard problem: getting LLMs to handle real spreadsheet tasks. Spreadsheets in the wild involve multi-step operations, conditional logic, error recovery, and complex formula construction — precisely the kind of task where prompted general-purpose LLMs fall apart.
Their solution was straightforward in concept but powerful in execution: train spreadsheet agents using reinforcement learning with task-specific reward signals. Rather than relying on clever prompting over general LLMs — which prior work showed struggles with real-world spreadsheet complexity — Spreadsheet-RL trained agents end-to-end on actual spreadsheet tasks, optimizing for task completion, formula correctness, and operational efficiency.
The results were decisive. RL-trained agents significantly outperformed prompted baselines across the board. For multi-step spreadsheet operations requiring planning, the RL agents showed dramatically better performance. For error recovery — one of the hardest challenges in spreadsheet automation — the RL-trained agents could self-correct in ways prompted models simply could not.
SkillOpt: Treat Skills as Trainable State
Yang et al. (May 2026) took this insight and went meta with SkillOpt. Current approaches to skill acquisition treat it as either prompt engineering (write better skill descriptions) or retrieval augmentation (find better skill examples from a database). Both are heuristic approaches — they don’t optimize for anything.
SkillOpt instead treats the entire skill library as a trainable policy — state that is optimized with the same discipline as weight-space optimization in neural networks. The reward function is simple: task success. The agent’s skill library is the policy being optimized. Each time the agent attempts a task, it gets feedback on which skills helped and which didn’t. Over time, the skill library evolves to maximize task success across the distribution of tasks the agent encounters.
This is reinforcement learning at the meta-level — RL applied to the process of skill acquisition itself. It represents a paradigm shift from “how do we prompt the model to use its skills?” to “how do we train the model to acquire better skills?”
Ratchet: The Sobering Reality Check
Zhang et al. (May 2026) delivered a finding that puts the brakes on fully autonomous skill evolution — for now. Ratchet implements a four-step skill management loop: write, retrieve, curate, retire. The study compared LLM-authored skills against human-curated skills in an agentic setting. The results were stark: LLM-authored skills delivered +0.0pp improvement over baselines — literally zero gain — while human-curated skills delivered +16.2pp improvement.
This finding suggests that current post-training methods don’t adequately prepare models to evaluate their own outputs. The model can write skills, but it can’t judge whether those skills are any good. However, this isn’t a dead end. The 16.2pp gap between LLM-authored and human-curated skills is effectively a specification for what post-training needs to fix. If future post-training methods can close that gap by teaching models to evaluate their own outputs, fully autonomous agent skill evolution becomes achievable.
> The trajectory: Spreadsheet-RL shows narrow RL working → SkillOpt shows how to generalize → Ratchet reveals the gap (self-evaluation) that needs closing.
—
Training Stability: The Practical Toolkit Gets Serious
Muon Optimizer Gets a Rigorous Theory
Mustafi et al. (May 2026) provided the rigorous theoretical foundation for the Muon optimizer. By identifying the regularized Muon update as a mirror descent step under a smoothed nuclear norm, the authors connected optimizer design to fundamental probability geometry. Muon isn’t just empirically effective — it has principled mathematical reasons for being so. For post-training practitioners, optimizer choice directly affects convergence speed and solution quality.
Training-Free Looped Transformers
Chen et al. (May 2026) introduced a lightweight inference-time wrapper that loops a mid-stack block of layers of a frozen checkpoint — no additional training required. The key finding is that naive block reapplication degrades performance, but proper loop application can improve reasoning depth. This is a post-training approach without training: you get the benefits of deeper computation without fine-tuning cost.
—
Reward Design: The Highest-Leverage Skill
If post-training is fundamentally about optimizing against reward signals, then reward design emerges as the highest-leverage skill in AI research.
Geo-Align (Li et al., May 2026) introduced metric geometry rewards for video alignment — applicable wherever geometric ground truth exists. FM-CGM (Komanduri & Wu, May 2026) suggests using causal structure as a reward signal: reward models for making causally correct predictions, not just statistically plausible ones — a fundamentally different alignment target than maximizing likelihood.
—
Shannon Scaling Law: A Unifying Theory
“LLMs as Noisy Channels” (Ouyang et al., May 2026) proposed the Shannon Scaling Law, modeling LLM training as information transmission over a noisy channel. The framework explains non-monotonic phenomena like catastrophic overtraining — where performance deteriorates despite increased compute — and provides principled guidance for when to stop training.
> Why it matters: This is the most ambitious theoretical framework in the post-training corpus. If LLM training is fundamentally an information transmission problem, post-training is about optimizing the channel — not just adding more compute.
—
The Year Ahead: Integration, Not Discovery
The research surveyed across these 51 papers identifies failure modes — clipping bottlenecks, self-evaluation gaps, optimizer choice — and provides concrete fixes. The next year’s challenge is integration: baking these fixes into standardized post-training pipelines that produce predictably aligned models at scale.
The open question remains: Can post-training be made reliable enough for high-stakes deployment?
Post-training isn’t a finishing step anymore. It’s the main event. And the research community has handed us the tools. Now we need to build the assembly line.
—
This article is part of the Frontier AI Research Digest backfill series, covering 51 papers on RL & Post-Training from May 2025 – May 2026.

Leave a Reply