Week 22, 2026 — Agentic Systems & Skills

Agent research had a breakthrough week, with advances in skill optimization, long-horizon memory management, and production-scale deployment of autonomous code review.

SkillOpt: Training Agent Skills Like Neural Network Weights

SkillOpt by Yifan Yang et al. introduces the first systematic controllable text-space optimizer for agent skills. An optimizer model turns scored rollouts into bounded add/delete/replace edits on skill documents, accepting changes only when validation score strictly improves. Features include: textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta updates for stable training. Results: beats all competitors across 52 evaluated cells. On GPT-5.5, lifts no-skill accuracy by +23.5 points in direct chat and +24.8 inside Codex. Skills transfer across model scales and execution environments. Paper

Meta-Cognitive Memory Policy Optimization

MMPO by Ziyan Liu et al. tackles the fundamental problem that memory-augmented agents degrade as summaries progressively discard task-relevant information. Instead of outcome-based RL, MMPO uses Belief Entropy — a self-supervised proxy for epistemic uncertainty given current memory — as fine-grained supervision. Maintains 97.1% performance even at 1.75M-token contexts, the new SOTA for long-horizon agent memory. Paper

RADAR: Meta’s 535K+ Automated Code Review Pipeline

RADAR (Risk Aware Diff Auto Review) at Meta processed 535K+ diffs, landing 331K+ without human review. The multi-stage funnel classifies by authorship, applies eligibility gates, static heuristics, a learned Diff Risk Score, LLM-based review, and deterministic validation. Key metrics: revert rate 1/3 of non-RADAR diffs, production incident rate 1/50, median time to close reduced by 330%. As AI-driven code volume grows 105.9% YoY, automated review at this scale is essential. Paper

PushBench: Agents Don’t Know When to Stop

PushBench by Yuandao Cai et al. measures Quantitative Goal Persistence: whether agents keep working until a verified count is complete. Claude Code and Codex CLI solve many 50-artifact tasks but drop to 3/9 at 100 artifacts. A state-tracking controller reaches 69-78% success while eliminating duplicates. Quantitative goals stress a different reliability dimension than local task competence. Paper

Loong: RL-Optimized Document Translation Agent

Loong by Yutong Wang et al. uses a 3E memory module (Essence-Exemplar-Entity) with RL-optimized context selection for long-document translation. Average gains of +13.0 points across English↔Chinese, German, French — with strong generalization to domains and robustness to contextual noise. Paper

Additional papers:
GRASP: Plan-Guided Graph Retrieval — New SOTA on STaRK benchmarks (+11.9 Hit@1)
Contextual Belief Management (CBM) — RL reduces belief-tracking failures by 70.9%
Proactive Agents with TGL Triggers — 4-83x faster than LLM-as-trigger, 14x better F1
Unifying Temporal and Structural Credit Assignment — Block coordinate descent for multi-agent prompt optimization

Key insight: Agent skills can now be trained with the same rigor as neural network weights — opening the path to systematic, reproducible agent improvement rather than ad-hoc prompt engineering.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *