{"id":48,"date":"2026-05-31T13:27:52","date_gmt":"2026-05-31T17:27:52","guid":{"rendered":"https:\/\/monizesairesearch.com\/index.php\/2026\/05\/31\/weekly-research-digest-2\/"},"modified":"2026-05-31T20:25:47","modified_gmt":"2026-06-01T00:25:47","slug":"weekly-research-digest-2","status":"publish","type":"post","link":"https:\/\/monizesairesearch.com\/index.php\/2026\/05\/31\/weekly-research-digest-2\/","title":{"rendered":"Week 22, 2026 \u2014 Reasoning &#038; Reinforcement Learning for LLMs"},"content":{"rendered":"<p>Test-time compute and reasoning methods dominated this week&#8217;s research, with breakthroughs in self-verification, efficient sampling, and working memory mechanisms.<\/p>\n<h2>Self-Trained Verification Unlocks Both Test-Time and Training-Time Gains<\/h2>\n<p><strong>Self-Trained Verification (STV)<\/strong> by Chen Henry Wu and Aditi Raghunathan addresses the central bottleneck in LLM self-improvement: the verifier. The key insight is that while a model cannot catch its own errors alone, it <em>can<\/em> when shown the reference solution. By training the verifier to imitate a more informed version of itself, STV doubles accuracy on hard math and lifts scientific reasoning from 1.5% to 21%. The follow-up <strong>Verifier-in-the-Loop (ViL)<\/strong> training \u2014 using STV&#8217;s feedback inside verification-refinement loops \u2014 yields a further 33% pass@1 gain on top of an RL-converged generator. <a href=\"https:\/\/arxiv.org\/abs\/2605.30290v1\">Paper<\/a><\/p>\n<h2>Entropy-Guided Decision Point Sampling<\/h2>\n<p>Felix Zhou and colleagues identified a critical flaw in existing test-time sampling methods: uniformly random &#8220;cut&#8221; positions mostly rewrite local details rather than revisiting true decision points. Their <strong>Entropy-Cut Metropolis-Hastings<\/strong> uses next-token entropy as a proxy to identify key reasoning decisions (e.g., choice of proof strategy or algorithm). They prove mixing time scales with the number of decisions, not tokens \u2014 and empirically outperform RL-trained models across MATH500, HumanEval, and AIME26. <a href=\"https:\/\/arxiv.org\/abs\/2605.30327v1\">Paper<\/a><\/p>\n<h2>Reasoning in Memory: Working Memory Without Generated Thoughts<\/h2>\n<p><strong>Reasoning in Memory (RiM)<\/strong> by Lukas Aichberger and Sepp Hochreiter replaces autoregressive reasoning chains with fixed memory blocks processed in a single forward pass. Fixed sequences of special tokens unlock working-memory capacity without generating intermediate thoughts \u2014 decoupling internal computation from external communication. A two-stage curriculum first grounds memory blocks by predicting explicit reasoning steps, then discards step-level supervision. Across model families and sizes, RiM matches or exceeds existing latent reasoning methods. <a href=\"https:\/\/arxiv.org\/abs\/2605.30343v1\">Paper<\/a><\/p>\n<h2>CoSPlay: Cooperative Code-UT Self-Play Without Ground Truth<\/h2>\n<p><strong>CoSPlay<\/strong> by Zhangyi Hu et al. achieves test-time code scaling without ground-truth unit tests by having code and test pools co-evolve through cooperative self-play. A bidirectional pass-count matrix iteratively prunes weak codes and unreliable tests. On Qwen2.5-7B-Instruct, it lifts BoN from 22.1% to 33.2% and unit test accuracy from 14.6% to 78.3%, matching the RLVR-trained CURE-7B without any training. <a href=\"https:\/\/arxiv.org\/abs\/2605.23491v1\">Paper<\/a><\/p>\n<h2>ARES: Automated Rubric Synthesis for Scalable RL<\/h2>\n<p><strong>ARES<\/strong> by Xiaoyuan Li et al. automatically constructs rubric-annotated RL data at scale from raw pretraining documents. Converting source knowledge into question-answer pairs with instance-specific weighted rubrics, ARES generates 100K instances across ten domains. Rubric-based RL with ARES outperforms continual pretraining, SFT, and binary-reward RL \u2014 with largest gains on multi-dimensional open-ended tasks. <a href=\"https:\/\/arxiv.org\/abs\/2605.23454v1\">Paper<\/a><\/p>\n<p><strong>Additional papers:<\/strong><br \/>\n&#8211; <a href=\"https:\/\/arxiv.org\/abs\/2605.23384v1\">Metacognition as Reward (MaR)<\/a> \u2014 Qwen3.5-9B + MaR surpasses GPT-OSS-120B on average<br \/>\n&#8211; <a href=\"https:\/\/arxiv.org\/abs\/2605.30201v1\">HPO: Hysteretic Policy Optimization<\/a> \u2014 Addresses sparse-reward GRPO failure modes<br \/>\n&#8211; <a href=\"https:\/\/arxiv.org\/abs\/2605.30323v1\">In-Context Reward Adaptation<\/a> \u2014 Transformers adapting rewards to unseen preferences<br \/>\n&#8211; <a href=\"https:\/\/arxiv.org\/abs\/2605.30251v1\">CCOPD: Canonical-Context On-Policy Distillation<\/a> \u2014 32% improvement on multi-turn fragmented conversations<\/p>\n<p><strong>Key insight:<\/strong> The next frontier in reasoning is not bigger models \u2014 it&#8217;s better verification and more efficient test-time computation.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Test-time compute and reasoning methods dominated this week&#8217;s research, with breakthroughs in self-verification, efficient sampling, and working memory mechanisms. Self-Trained Verification Unlocks Both Test-Time and Training-Time Gains Self-Trained Verification (STV) by Chen Henry Wu and Aditi Raghunathan addresses the central bottleneck in LLM self-improvement: the verifier. The key insight is that while a model cannot [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":93,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3,16],"tags":[],"class_list":["post-48","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-topic-02","category-weekly-digest"],"_links":{"self":[{"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/posts\/48","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/comments?post=48"}],"version-history":[{"count":2,"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/posts\/48\/revisions"}],"predecessor-version":[{"id":77,"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/posts\/48\/revisions\/77"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/media\/93"}],"wp:attachment":[{"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/media?parent=48"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/categories?post=48"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/tags?post=48"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}