Week 22, 2026 — Reasoning & Reinforcement Learning for LLMs

Test-time compute and reasoning methods dominated this week’s research, with breakthroughs in self-verification, efficient sampling, and working memory mechanisms.

Self-Trained Verification Unlocks Both Test-Time and Training-Time Gains

Self-Trained Verification (STV) by Chen Henry Wu and Aditi Raghunathan addresses the central bottleneck in LLM self-improvement: the verifier. The key insight is that while a model cannot catch its own errors alone, it can when shown the reference solution. By training the verifier to imitate a more informed version of itself, STV doubles accuracy on hard math and lifts scientific reasoning from 1.5% to 21%. The follow-up Verifier-in-the-Loop (ViL) training — using STV’s feedback inside verification-refinement loops — yields a further 33% pass@1 gain on top of an RL-converged generator. Paper

Entropy-Guided Decision Point Sampling

Felix Zhou and colleagues identified a critical flaw in existing test-time sampling methods: uniformly random “cut” positions mostly rewrite local details rather than revisiting true decision points. Their Entropy-Cut Metropolis-Hastings uses next-token entropy as a proxy to identify key reasoning decisions (e.g., choice of proof strategy or algorithm). They prove mixing time scales with the number of decisions, not tokens — and empirically outperform RL-trained models across MATH500, HumanEval, and AIME26. Paper

Reasoning in Memory: Working Memory Without Generated Thoughts

Reasoning in Memory (RiM) by Lukas Aichberger and Sepp Hochreiter replaces autoregressive reasoning chains with fixed memory blocks processed in a single forward pass. Fixed sequences of special tokens unlock working-memory capacity without generating intermediate thoughts — decoupling internal computation from external communication. A two-stage curriculum first grounds memory blocks by predicting explicit reasoning steps, then discards step-level supervision. Across model families and sizes, RiM matches or exceeds existing latent reasoning methods. Paper

CoSPlay: Cooperative Code-UT Self-Play Without Ground Truth

CoSPlay by Zhangyi Hu et al. achieves test-time code scaling without ground-truth unit tests by having code and test pools co-evolve through cooperative self-play. A bidirectional pass-count matrix iteratively prunes weak codes and unreliable tests. On Qwen2.5-7B-Instruct, it lifts BoN from 22.1% to 33.2% and unit test accuracy from 14.6% to 78.3%, matching the RLVR-trained CURE-7B without any training. Paper

ARES: Automated Rubric Synthesis for Scalable RL

ARES by Xiaoyuan Li et al. automatically constructs rubric-annotated RL data at scale from raw pretraining documents. Converting source knowledge into question-answer pairs with instance-specific weighted rubrics, ARES generates 100K instances across ten domains. Rubric-based RL with ARES outperforms continual pretraining, SFT, and binary-reward RL — with largest gains on multi-dimensional open-ended tasks. Paper

Additional papers:
Metacognition as Reward (MaR) — Qwen3.5-9B + MaR surpasses GPT-OSS-120B on average
HPO: Hysteretic Policy Optimization — Addresses sparse-reward GRPO failure modes
In-Context Reward Adaptation — Transformers adapting rewards to unseen preferences
CCOPD: Canonical-Context On-Policy Distillation — 32% improvement on multi-turn fragmented conversations

Key insight: The next frontier in reasoning is not bigger models — it’s better verification and more efficient test-time computation.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *