{"id":20,"date":"2025-11-22T09:00:00","date_gmt":"2025-11-22T14:00:00","guid":{"rendered":"https:\/\/monizesairesearch.com\/index.php\/2025\/11\/22\/backfill-07-rl-post-training\/"},"modified":"2026-05-26T01:48:41","modified_gmt":"2026-05-26T05:48:41","slug":"backfill-07-rl-post-training","status":"publish","type":"post","link":"https:\/\/monizesairesearch.com\/index.php\/2025\/11\/22\/backfill-07-rl-post-training\/","title":{"rendered":"Post-Training Is the New Pre-Training: Why RL Training Loops Are Reshaping AI in 2026"},"content":{"rendered":"<p><strong>51 papers analyzed<\/strong> | May 2025 \u2013 May 2026<\/p>\n<p>&#8212;<\/p>\n<p>If you think the biggest AI breakthroughs come from bigger models trained on more data, you&#8217;re looking in the wrong place.<\/p>\n<p>The most important AI research of 2025\u20132026 didn&#8217;t happen during pre-training. It happened <em>after<\/em> \u2014 in the post-training phase where models learn to actually be useful, to follow instructions, to reason reliably, and to act as agents. And the tool driving this revolution? Reinforcement learning, applied in ways that go far beyond the chat-based RLHF the industry broadly adopted in 2023.<\/p>\n<p>Fifty-one papers surveyed across this frontier paint a clear picture: <strong>post-training is the new pre-training.<\/strong> It&#8217;s where bias is introduced, where reasoning style is determined, where agents learn skills, and where the difference between a mediocre model and a transformative one is made.<\/p>\n<p>Let&#8217;s dive into what the research actually says.<\/p>\n<p>&#8212;<\/p>\n<h2>The Big Bomb: Bias Comes from Post-Training, Not Pre-Training<\/h2>\n<p>The year&#8217;s most consequential finding comes from Bladon &#038; Bent (May 2026): <strong>geopolitical bias in LLMs originates in post-training, amplified by the language of the prompt.<\/strong><\/p>\n<p>This work \u2014 covered across multiple topics in the Frontier AI Research Digest \u2014 is perhaps the year&#8217;s single most important post-training result. By testing base model vs. chat model pairs from seven different labs, the researchers isolated exactly where bias enters the pipeline. The answer is unambiguous: pre-trained base models are relatively neutral. It&#8217;s the RLHF and instruction-tuning stages \u2014 the very steps designed to make models &#8220;helpful and harmless&#8221; \u2014 that systematically inject geopolitical bias.<\/p>\n<p>The mechanism matters here. The authors found that the bias isn&#8217;t uniform; it&#8217;s amplified by the language of the prompt itself. Different phrasings produce different bias profiles from the same model, suggesting that the post-training process encodes not just preferences but specific cultural and political framings embedded in the human annotations and reward signals used during alignment.<\/p>\n<p>This is a profound reframing of how we think about AI safety. For years, the AI safety community has worried about data curation during pre-training \u2014 filtering toxic content, balancing representation, scrubbing personally identifiable information. But this research suggests the real problem is downstream. It&#8217;s in the reward signals, the preference datasets, and the human annotations that shape post-training. The cure \u2014 better reward design, more representative preference data, more careful construction of alignment targets \u2014 lies in how we train, not what we train on.<\/p>\n<p>> <strong>Key implication:<\/strong> If you care about building aligned AI, audit your post-training pipeline \u2014 not just your pre-training corpus.<\/p>\n<p>&#8212;<\/p>\n<h2>RL for Skill Acquisition: Training Agents, Not Prompting Them<\/h2>\n<p>The second major theme of the year was the growing convergence of reinforcement learning and agentic skill learning. Three papers define the trajectory and collectively point toward a future where agents are trained end-to-end for specific capabilities rather than prompted into competence.<\/p>\n<h3>Spreadsheet-RL: Domain-Specific Training Beats General Prompting<\/h3>\n<p>Chi et al. (May 2026) tackled a deceptively hard problem: getting LLMs to handle real spreadsheet tasks. Spreadsheets in the wild involve multi-step operations, conditional logic, error recovery, and complex formula construction \u2014 precisely the kind of task where prompted general-purpose LLMs fall apart.<\/p>\n<p>Their solution was straightforward in concept but powerful in execution: train spreadsheet agents using reinforcement learning with task-specific reward signals. Rather than relying on clever prompting over general LLMs \u2014 which prior work showed struggles with real-world spreadsheet complexity \u2014 Spreadsheet-RL trained agents end-to-end on actual spreadsheet tasks, optimizing for task completion, formula correctness, and operational efficiency.<\/p>\n<p>The results were decisive. RL-trained agents significantly outperformed prompted baselines across the board. For multi-step spreadsheet operations requiring planning, the RL agents showed dramatically better performance. For error recovery \u2014 one of the hardest challenges in spreadsheet automation \u2014 the RL-trained agents could self-correct in ways prompted models simply could not.<\/p>\n<h3>SkillOpt: Treat Skills as Trainable State<\/h3>\n<p>Yang et al. (May 2026) took this insight and went meta with SkillOpt. Current approaches to skill acquisition treat it as either prompt engineering (write better skill descriptions) or retrieval augmentation (find better skill examples from a database). Both are heuristic approaches \u2014 they don&#8217;t optimize for anything.<\/p>\n<p>SkillOpt instead treats the entire skill library as a trainable policy \u2014 state that is optimized with the same discipline as weight-space optimization in neural networks. The reward function is simple: task success. The agent&#8217;s skill library is the policy being optimized. Each time the agent attempts a task, it gets feedback on which skills helped and which didn&#8217;t. Over time, the skill library evolves to maximize task success across the distribution of tasks the agent encounters.<\/p>\n<p>This is reinforcement learning at the meta-level \u2014 RL applied to the process of skill acquisition itself. It represents a paradigm shift from &#8220;how do we prompt the model to use its skills?&#8221; to &#8220;how do we train the model to acquire better skills?&#8221;<\/p>\n<h3>Ratchet: The Sobering Reality Check<\/h3>\n<p>Zhang et al. (May 2026) delivered a finding that puts the brakes on fully autonomous skill evolution \u2014 for now. Ratchet implements a four-step skill management loop: write, retrieve, curate, retire. The study compared LLM-authored skills against human-curated skills in an agentic setting. The results were stark: LLM-authored skills delivered <strong>+0.0pp improvement<\/strong> over baselines \u2014 literally zero gain \u2014 while human-curated skills delivered <strong>+16.2pp improvement<\/strong>.<\/p>\n<p>This finding suggests that current post-training methods don&#8217;t adequately prepare models to evaluate their own outputs. The model can write skills, but it can&#8217;t judge whether those skills are any good. However, this isn&#8217;t a dead end. The 16.2pp gap between LLM-authored and human-curated skills is effectively a specification for what post-training needs to fix. If future post-training methods can close that gap by teaching models to evaluate their own outputs, fully autonomous agent skill evolution becomes achievable.<\/p>\n<p>> <strong>The trajectory:<\/strong> Spreadsheet-RL shows narrow RL working \u2192 SkillOpt shows how to generalize \u2192 Ratchet reveals the gap (self-evaluation) that needs closing.<\/p>\n<p>&#8212;<\/p>\n<h2>Training Stability: The Practical Toolkit Gets Serious<\/h2>\n<h3>Muon Optimizer Gets a Rigorous Theory<\/h3>\n<p>Mustafi et al. (May 2026) provided the rigorous theoretical foundation for the Muon optimizer. By identifying the regularized Muon update as a mirror descent step under a smoothed nuclear norm, the authors connected optimizer design to fundamental probability geometry. Muon isn&#8217;t just empirically effective \u2014 it has principled mathematical reasons for being so. For post-training practitioners, optimizer choice directly affects convergence speed and solution quality.<\/p>\n<h3>Training-Free Looped Transformers<\/h3>\n<p>Chen et al. (May 2026) introduced a lightweight inference-time wrapper that loops a mid-stack block of layers of a frozen checkpoint \u2014 no additional training required. The key finding is that naive block reapplication degrades performance, but proper loop application can improve reasoning depth. This is a post-training approach without training: you get the benefits of deeper computation without fine-tuning cost.<\/p>\n<p>&#8212;<\/p>\n<h2>Reward Design: The Highest-Leverage Skill<\/h2>\n<p>If post-training is fundamentally about optimizing against reward signals, then reward design emerges as the highest-leverage skill in AI research.<\/p>\n<p><strong>Geo-Align<\/strong> (Li et al., May 2026) introduced metric geometry rewards for video alignment \u2014 applicable wherever geometric ground truth exists. <strong>FM-CGM<\/strong> (Komanduri &#038; Wu, May 2026) suggests using causal structure as a reward signal: reward models for making causally correct predictions, not just statistically plausible ones \u2014 a fundamentally different alignment target than maximizing likelihood.<\/p>\n<p>&#8212;<\/p>\n<h2>Shannon Scaling Law: A Unifying Theory<\/h2>\n<p><strong>&#8220;LLMs as Noisy Channels&#8221;<\/strong> (Ouyang et al., May 2026) proposed the Shannon Scaling Law, modeling LLM training as information transmission over a noisy channel. The framework explains non-monotonic phenomena like catastrophic overtraining \u2014 where performance deteriorates despite increased compute \u2014 and provides principled guidance for when to stop training.<\/p>\n<p>> <strong>Why it matters:<\/strong> This is the most ambitious theoretical framework in the post-training corpus. If LLM training is fundamentally an information transmission problem, post-training is about optimizing the channel \u2014 not just adding more compute.<\/p>\n<p>&#8212;<\/p>\n<h2>The Year Ahead: Integration, Not Discovery<\/h2>\n<p>The research surveyed across these 51 papers identifies failure modes \u2014 clipping bottlenecks, self-evaluation gaps, optimizer choice \u2014 and provides concrete fixes. The next year&#8217;s challenge is integration: baking these fixes into standardized post-training pipelines that produce predictably aligned models at scale.<\/p>\n<p>The open question remains: <strong>Can post-training be made reliable enough for high-stakes deployment?<\/strong><\/p>\n<p>Post-training isn&#8217;t a finishing step anymore. It&#8217;s the main event. And the research community has handed us the tools. Now we need to build the assembly line.<\/p>\n<p>&#8212;<\/p>\n<p><em>This article is part of the Frontier AI Research Digest backfill series, covering 51 papers on RL &#038; Post-Training from May 2025 \u2013 May 2026.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>51 papers analyzed | May 2025 \u2013 May 2026 &#8212; If you think the biggest AI breakthroughs come from bigger models trained on more data, you&#8217;re looking in the wrong place. The most important AI research of 2025\u20132026 didn&#8217;t happen during pre-training. It happened after \u2014 in the post-training phase where models learn to actually [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":19,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[8],"tags":[],"class_list":["post-20","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-topic-07"],"_links":{"self":[{"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/posts\/20","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/comments?post=20"}],"version-history":[{"count":1,"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/posts\/20\/revisions"}],"predecessor-version":[{"id":38,"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/posts\/20\/revisions\/38"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/media\/19"}],"wp:attachment":[{"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/media?parent=20"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/categories?post=20"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/tags?post=20"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}