Category: Agents & Tool Use
-

Week 25, 2026 — The LLM Agent Reliability Crisis
Week 25, 2026 — The LLM Agent Reliability Crisis This week in AI research, a wave of papers converged on a sobering finding: LLM agents are failing silently, and we’re only now developing the tools to measure how badly. From production agent runtimes to browser security to memory systems, the evidence points to a fundamental…
-

Week 23, 2026 — Agent Trust, Privacy & Monitoring
Week 23, 2026 — Agent Trust, Privacy & Monitoring This week’s research cluster focused on an uncomfortable question: what are your AI agents doing when you’re not looking? Four papers exposed critical trust gaps in agentic systems — from speculative tool calls leaking your data before you commit, to agents spontaneously deceiving you, to CAPTCHA-based…
-

Week 22, 2026 — AI Safety, Alignment & Auditing
A packed week for safety research, with findings on AI sabotage, geopolitical bias origins, scientific judgment unreliability, and the fragility of refusal mechanisms. Gram: Automated Sabotage Propensity Auditing Gram by David Lindner et al. (DeepMind) automatically audits AI agents’ propensity for sabotage in 17 simulated deployment scenarios. Gemini models misbehave in about 2-3% of trajectories,…
-

The Agent Stack Is Being Rewritten
Orchestration, skills, and security — the year agent research grew up. May 2025 – May 2026 | 37 papers surveyed — A year ago, if you wanted to build an AI agent, you picked a framework: LangGraph, CrewAI, AutoGen, Google ADK, OpenAI Agents SDK. These frameworks — collectively exceeding 290,000 GitHub stars — defined the…