{"id":47,"date":"2026-05-31T13:27:50","date_gmt":"2026-05-31T17:27:50","guid":{"rendered":"https:\/\/monizesairesearch.com\/index.php\/2026\/05\/31\/weekly-research-digest\/"},"modified":"2026-05-31T20:25:40","modified_gmt":"2026-06-01T00:25:40","slug":"weekly-research-digest","status":"publish","type":"post","link":"https:\/\/monizesairesearch.com\/index.php\/2026\/05\/31\/weekly-research-digest\/","title":{"rendered":"Week 22, 2026 \u2014 LLM Training &#038; Scaling Laws"},"content":{"rendered":"<p>This week brought transformative advances in understanding how large language models scale and train \u2014 from a unified theory of scaling failures to practical recipes for MoE hyperparameter transfer and data mixture auditing.<\/p>\n<h2>Shannon Scaling Law Unifies Training Phenomena<\/h2>\n<p>Xu Ouyang and colleagues proposed the <strong>Shannon Scaling Law<\/strong>, treating LLM training as information transmission over a noisy channel via the Shannon-Hartley theorem. This framework explains <em>catastrophic overtraining<\/em> and <em>quantization-induced degradation<\/em> \u2014 non-monotonic phenomena where more compute hurts performance. The key insight: scaling model size or data without preserving signal-to-noise ratio (SNR) amplifies noise, inducing a transition from monotonic improvement to U-shaped degradation. The theory achieves R\u00b2=0.847 predicting unseen 12B models from \u22646.9B fits \u2014 something prior monotonic laws cannot do. <a href=\"https:\/\/arxiv.org\/abs\/2605.23901v1\">Paper<\/a><\/p>\n<h2>Tune Dense Once, Transfer to All MoE<\/h2>\n<p><strong>Complete-muE<\/strong> by Hongwu Peng et al. solves the long-standing challenge of hyperparameter transfer from dense to Mixture-of-Experts models. The two-bridge system maps between dense FFN and any MoE variant (sparse, shared, group-balanced hybrids) via active-width \u00b5P and first-order SDE correction. The practical recipe is powerful: tune hyperparameters on a single dense model, and they transfer near-optimally to all MoE configurations, eliminating expensive architecture-specific hyperparameter search. <a href=\"https:\/\/arxiv.org\/abs\/2605.23893v1\">Paper<\/a><\/p>\n<h2>LLMSurgeon: Reverse-Engineering Training Data DNA<\/h2>\n<p><strong>LLMSurgeon<\/strong> from Yaxin Luo et al. performs post-hoc auditing of an LLM&#8217;s pretraining data mixture using only generated text. It casts Data Mixture Surgery as an inverse problem under label-shift assumptions, using calibrated soft confusion matrices to recover latent domain priors. With the companion <strong>LLMScan<\/strong> evaluation suite built from transparent open-source LLMs, LLMSurgeon recovers domain mixtures with high fidelity \u2014 a crucial capability for accountability in an era where data composition is rarely disclosed. <a href=\"https:\/\/arxiv.org\/abs\/2605.30348v1\">Paper<\/a><\/p>\n<h2>Training-Free Looped Transformers<\/h2>\n<p>Lizhang Chen showed that looping a contiguous mid-stack block of layers at inference time \u2014 without any additional training \u2014 can significantly improve performance. Treating block reapplication as damped refinement of an ODE Euler step, the method boosts Qwen3-4B-Instruct by +2.64pp on MMLU-Pro without any finetuning. Works across dense, MoE, and MLA+MoE model families. <a href=\"https:\/\/arxiv.org\/abs\/2605.23872v1\">Paper<\/a><\/p>\n<h2>Distillation Without a Strong Teacher<\/h2>\n<p>Taiming Lu and Zhuang Liu challenged the core assumption of knowledge distillation: that stronger teachers produce better students. They found small, undertrained teachers <em>improve<\/em> larger students, while pushing the teacher further can saturate or reverse gains. Distillation improves generalization (OOD and downstream) more readily than in-domain fitting \u2014 a counterintuitive but practically significant finding. <a href=\"https:\/\/arxiv.org\/abs\/2605.23857v1\">Paper<\/a><\/p>\n<p><strong>Additional papers:<\/strong><br \/>\n&#8211; <a href=\"https:\/\/arxiv.org\/abs\/2605.23591v1\">Asymmetric Scaling Laws from Sparse Features<\/a> \u2014 Double-descent peak near interpolation threshold<br \/>\n&#8211; <a href=\"https:\/\/arxiv.org\/abs\/2605.30334v1\">Demystifying Data Organization for Enhanced LLM Training<\/a> \u2014 STR and SAW ordering methods<br \/>\n&#8211; <a href=\"https:\/\/arxiv.org\/abs\/2605.30288v1\">MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection<\/a> \u2014 Half the tokens, full performance<\/p>\n<p><strong>Why this matters:<\/strong> This week&#8217;s papers represent a maturation of LLM training science \u2014 moving from heuristic recipes to principled frameworks grounded in information theory, inverse problems, and physics-aware optimization.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This week brought transformative advances in understanding how large language models scale and train \u2014 from a unified theory of scaling failures to practical recipes for MoE hyperparameter transfer and data mixture auditing. Shannon Scaling Law Unifies Training Phenomena Xu Ouyang and colleagues proposed the Shannon Scaling Law, treating LLM training as information transmission over [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":92,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2,16],"tags":[],"class_list":["post-47","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-topic-01","category-weekly-digest"],"_links":{"self":[{"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/posts\/47","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/comments?post=47"}],"version-history":[{"count":2,"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/posts\/47\/revisions"}],"predecessor-version":[{"id":75,"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/posts\/47\/revisions\/75"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/media\/92"}],"wp:attachment":[{"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/media?parent=47"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/categories?post=47"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/tags?post=47"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}