54 papers surveyed | May 2025 – May 2026
—
For years, multimodal AI was mostly a wiring problem: take a vision encoder, glue it to an LLM, add a projection layer, and call it a day. In 2025-2026, that era ended. The field stopped asking “how do we connect vision to language?” and started asking harder questions: What does it mean for a model to truly understand an image? How do we reason about spatial relationships without tokenizing everything into text? And what happens when multimodal models get so capable that they become a safety liability?
The answers, it turns out, involve much more than better connectors.
—
Thinking With Images (Not Just About Them)
The most productive frustration of the year was the realization that text-only chain-of-thought has fundamental limits for visual tasks. You can’t reason about spatial relationships, image composition, or visual analogies if your reasoning chain is purely textual — too much information gets lost in the translation from pixels to tokens.
ETCHR (Editing To Clarify and Harness Reasoning) by Zhang et al. made the decisive move: instead of cramming visual reasoning into a single monolithic model, decouple it. Use a dedicated image editing model to produce intermediate visual states — literally editing the image to clarify the question — and feed those edited images to a separate understanding model.
The results are clear: specialization beats unification. Rather than trying to cram visual manipulation and logical reasoning into a single model — which produces noisy, unreliable intermediate images — ETCHR lets each specialized model do what it does best. The image editing model focuses on pixel-level transformations. The understanding model focuses on logical reasoning. A language backbone coordinates them.
This modular approach isn’t just for ETCHR’s specific problem. It’s a principled architecture for multimodal reasoning: instead of a single model that does everything poorly, build a system of specialists coordinated by a generalist.
SPACENUM delivered a sobering result for anyone deploying VLMs in robotics or autonomous systems: the numerical outputs models produce for spatial coordinates and action magnitudes look plausible but aren’t genuinely grounded in spatial perception. If you’re counting on a VLM to tell a robot arm exactly where to move, you need to know that the numbers it outputs may be confident-sounding hallucinations. The framework tested two settings — numbers as dynamic transitions during reasoning and numbers as static spatial descriptions — and found systematic grounding failures in both.
PGT (Procedurally Generated Tasks) tackled the grounding problem from the data side. By overlaying unambiguous geometric primitives on images, PGT creates dense supervision that forces models to attend to fine-grained visual details — and simultaneously reveals exactly what kind of perception failures a model has. The dual-purpose design is elegantly practical: the same data that trains the model also diagnoses it.
—
Breaking Through the Latent Bottleneck
Latent space models hit a quality ceiling because decoders are reconstruction-oriented, not detail-oriented. A decoder’s job is to invert the encoder’s compression, not to add visual richness. This year saw multiple breakthroughs that cracked that ceiling open.
PiD (Pixel Diffusion) by Lu et al. introduced a decoding paradigm that uses pixel diffusion to add detail after latent decoding. The result: dramatically better megapixel-scale output quality while keeping the latent space compact. The insight is that generation and decoding constraints should be decoupled — the latent space can stay efficient if the pixel space gets its own generative model. Instead of trying to build a better decoder, PiD adds a generative post-processing step.
VDE (Velocity Decomposition and Estimation) by Tan et al. tackled the efficiency problem in rectified flow models differently. Previous cache-based approaches for diffusion models reused stale computations, causing quality degradation over the denoising trajectory as the cached features become increasingly irrelevant. VDE decomposes the velocity field into components and estimates each one dynamically, shifting the paradigm from “cache and hope” to “decompose and reconstruct.” The approach is training-free and works across image, video, and 3D generation.
MaSC addressed a measurement problem that’s been quietly undermining concept personalization research: existing metrics (CLIP-I, DINO, CLIP-T) correlate poorly with human perception because they attend to the whole image rather than separating concept identity from prompt following. MaSC’s masked similarity approach, which computes similarity only on the region of interest, provides a much better signal for evaluating how well a generated image preserves a reference concept while following a new prompt.
—
The Hard Problem: Compositional Understanding
CLIP-style contrastive pretraining has a subtle failure mode that Go et al. exposed with counterfactual phrase intervention: once coarse mismatches are removed from training data, stricter filtering no longer tracks compositional supervision. A global alignment score conflates “this pair is broadly plausible” with “every phrase in this caption is visually grounded.”
Consider a caption: “A red car next to a blue building.” A model might match this to an image of a red car next to a blue building — or to an image of a blue car next to a red building, since the objects and colors are all present. Global alignment can’t distinguish these cases. Go et al.’s counterfactual approach systematically tests whether each phrase is individually grounded by intervening on one phrase at a time and measuring the impact.
CEDAR by Kubaty et al. revealed the compositional structure of pretrained VL embeddings without increasing dimensionality, using adaptive rotation rather than expansion. This matters because sparse autoencoders (SAEs) — the current popular approach for interpretability — compromise the original embedding geometry. CEDAR preserves geometry while still achieving disentanglement, making it practical for downstream tasks.
SimVA (Spatio-Temporal Similarity Volume Aggregation) by So et al. constructed dense 4D spatio-temporal similarity volumes from patch-level visual-text similarities for open-vocabulary action recognition. Rather than the standard approach of pooling all video features into a single representation before computing text similarity, SimVA keeps the patch-level information, retaining the fine-grained localization that global pooling loses. For action recognition — where where and when something happens matters as much as what happens — this makes a significant difference.
—
Multimodal Safety: The Cross-Modal Attack Surface Expands
As models grow more capable, the attack surface grows with them — and it’s not symmetric across languages.
Ford et al. conducted the first systematic cross-lingual, multimodal red-teaming study, comparing jailbreak vulnerability in English vs. Spanish across four frontier MLLMs (Claude Sonnet 4.5, GPT-5, Pixtral Large, Qwen Omni). The finding that the attack surface is language-dependent reveals that alignment failures have a mechanistic structure — they aren’t random weaknesses but systematic vulnerabilities that shift with language and modality. A jailbreak that works in English may fail in Spanish, but a different one succeeds there. This means alignment must be evaluated across the full deployment language set, not just English.
General Hazard Detection by Ng et al. tackled the abstract concept of “hazard” — which can’t be reduced to fixed categories — by asking whether multimodal models can reason about hazard in a truly open-vocabulary way. The challenge is that hazard is context-dependent: a knife in a kitchen is not a hazard; a knife in a children’s sandbox is. This connects multimodal AI to safety-critical applications where the set of possible hazards is unknown at design time.
—
Beyond 2D: 3D and Video Understanding
HorizonStream by Cheng et al. diagnosed a fundamental architectural mismatch in streaming 3D reconstruction: streaming geometry is temporally heterogeneous (some evidence is short-lived, some persistent), but current architectures impose uniform influence patterns. Their long-horizon attention mechanism dynamically allocates attention across temporal scales — a solution that generalizes beyond 3D to any streaming perception task.
GenRecon by Schmid et al. tightly coupled reconstruction with a generative 3D prior (Trellis.2), casting scene reconstruction as conditional generation over spatially-localized chunks. The key insight: by tiling large scenes into overlapping chunks and inheriting the fidelity of the generative prior, reconstruction quality jumps while scaling to large scenes. This convergence of reconstruction and generation could define the next generation of 3D understanding.
ToolMerge by Shlapentokh-Rothman et al. used an LLM-based planner to decompose video queries into specialized tool calls, then merged the results. An LLM-based planner decides what visual tools to invoke — object detector, motion estimator, scene segmenter — based on the query, then merges the outputs. This modular approach to video understanding mirrors the broader trend: orchestrate specialists, don’t train a monolithic model.
—
The Theme That Unites It All
Across every sub-area — visual reasoning, generation, understanding, safety, 3D, video, audio — one pattern emerges consistently: modularity through specialization. ETCHR decouples visual editing from reasoning. ToolMerge decomposes queries into tool calls. Bernini separates semantic planning from pixel rendering. The “sweet spot” paper shows optimal alignment requires hybrid objectives.
No single architecture dominates all multimodal tasks. The best multimodal systems are orchestrated collections of specialized components. The open question — and the research frontier for the coming year — is how to design the coordination layer that routes information between them. Is it an LLM that calls tools? A learned attention mechanism? A diffusion-based planner? The answer likely depends on the task, and the frontier is understanding when to use which coordination strategy.
—
Part of the Frontier AI Research Digest backfill series (May 2025 – May 2026). 54 papers surveyed across vision-language models, multimodal understanding, generation, and evaluation.

Leave a Reply