The Year AI Learned to Plan Before It Rendered: Video & Image Generation’s Architecture Shift

37 papers surveyed | May 2025 – May 2026

Video and image generation research hit a turning point this year. The field moved past the era of “more compute, better pixels” and discovered something more fundamental: the hardest problems in visual generation aren’t about rendering — they’re about planning what to render.

The central insight crystallizing across 37 papers is simple but profound: the most successful systems separate semantic planning from pixel rendering, letting language models handle the “what” and diffusion models handle the “how.” This division of labor — which feels obvious in retrospect — produced better video, better long-form content, more controllable generation, and the first credible path toward physical realism.

Here’s what the year’s most important work reveals about where visual generation is heading.

Semantic Planning Meets Pixel Generation

The year’s defining architectural insight comes from “Bernini: Latent Semantic Planning for Video Diffusion” (Bernini Team, May 2026). Multimodal LLMs and diffusion models have each reached remarkable maturity in complementary domains — MLLMs excel at reasoning over heterogeneous inputs with strong semantic grounding, while diffusion models synthesize pixels with photorealistic fidelity. Bernini’s contribution is recognizing that these strengths are different and should remain so. By giving MLLMs responsibility for semantic planning — deciding what should happen, when, and in what order — and reserving diffusion models for pixel rendering — executing the plan with visual fidelity — the system produces videos that are both semantically coherent and visually compelling.

> Why this matters: This architecture will influence every video generation system that follows. The insight is that rendering and planning require fundamentally different capabilities, and forcing a single model to do both creates unnecessary compromises.

“DrawVideo: Generating Long Video from Storyboard Keyframe Sketches” (Xu et al., May 2026) extended this paradigm to long-form video by decomposing long videos into independently controllable shots. Each shot is defined by a black-and-white sketch (layout and pose), an appearance prompt (visual style), and motion metadata (camera movement, object motion). Sketch-guided generation gives users fine-grained control over pose, composition, and layout — exactly the kind of controllability that text-only prompts can’t provide. The decomposition into shots mirrors how professional animators work: blocking out keyframes first, filling in motion later.

Physical Realism and Motion Consistency

Video generation’s fundamental weakness has always been the same: motion that looks plausible but is physically wrong. Objects float, trajectories bend impossibly, and acceleration follows visual aesthetics rather than physics. Two papers attacked this problem from complementary directions.

“LaMo: Self-Supervised Latent Motion Priors for Physical Realism in Video Generation” (Jiang et al., May 2026) extracts motion priors from the unlabeled videos that diffusion models already train on. Rather than relying on external physics simulators or curated physics datasets, LaMo learns what motions are physically plausible by analyzing frame-to-frame changes in unlabeled training data. The latent motion prior provides implicit physics understanding — the model internalizes that objects don’t teleport, that momentum carries through frames, that gravity pulls downward — without explicit physics supervision.

> Key advantage: Self-supervised approaches scale with data. More training data naturally yields better physics understanding — the path to physically realistic generation without expensive physics simulation at training time.

“Geo-Align: Video Generation Alignment via Metric Geometry Reward” (Li et al., May 2026) tackles the same problem from the reward side. For camera-controlled video generation — where you specify camera paths and want consistent geometry — the metric geometry reward provides alignment signal based on known geometric relationships. Physical scales and camera parameters should be consistent across frames; Geo-Align’s reward function penalizes violations of these constraints. Where LaMo learns from observation, Geo-Align enforces from principle.

> The likely answer: Self-supervised priors for broad motion coverage, geometric rewards for hard constraints where physics is non-negotiable. Both, not either-or.

Latent-to-Pixel Breakthroughs

Most generation systems use a latent-to-pixel decoder that is reconstruction-oriented — optimized to invert the encoder rather than synthesize new details. This creates a ceiling on generation quality: if the latent space loses information, the decoder can’t recover it.

“PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion” (Lu et al., May 2026) breaks this ceiling by adding a pixel diffusion layer after latent decoding. The key insight is that decoders optimized for reconstruction accuracy don’t know how to add detail — they know how to reconstruct detail that was preserved in the latent space. PiD’s pixel diffusion layer adds megapixel-scale detail efficiently, fundamentally decoupling generation quality from latent space constraints.

“VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation” (Tan et al., May 2026) shifted the acceleration paradigm from “cache and hope” to “decompose and reconstruct.” Previous cache-based approaches for diffusion models reused stale computations, causing quality degradation over the denoising trajectory. VDE decomposes the velocity field into components and estimates each one dynamically, reconstructing the full velocity without stale cache entries. The training-free aspect is crucial: existing models get the speedup without retraining.

Video Insertion, Editing, and Multi-Shot Consistency

“Smart-Insertion-V: Photorealistic Video Insertion via a Closed-Loop Feedback Dual-Stream Framework” (Cao et al., May 2026) tackles a deceptively hard problem: inserting an object into a video when the reference object has a different visual style than the target scene. The dual-stream framework runs video insertion and image style transfer concurrently, with closed-loop feedback ensuring style adaptation and insertion reinforce each other.

“EM-Vid: Training-Free Entity-Centric Memory for Efficient and Consistent Multi-Shot Video Generation” (Vandersanden et al., May 2026) addresses multi-shot consistency: as shots change, entities should maintain consistent appearance without full-frame memory that entangles persistent entity information with transient scene context. Entity-centric memory — indexing latent patches by entity — separates persistent from transient information.

> Why this matters: Multi-shot consistency is one of the hardest open problems in video generation. EM-Vid shows that the solution may be structural rather than computational — better memory organization rather than more memory.

3D and Multi-View Generation

“GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction” (Schmid et al., May 2026) casts 3D scene reconstruction as conditional generation over overlapping chunks, inheriting the fidelity of state-of-the-art generative shape models. By tiling large scenes into chunks and generating each one conditioned on neighbors, GenRecon scales to large scene extents while maintaining quality that standard methods can’t match.

“Good Token Hunting: A Hitchhiker’s Guide to Token Selection for Visual Geometry Transformers” (Zheng et al., May 2026) addresses the quadratic cost of global attention in multi-view 3D transformers by restricting key/value tokens per query. The practical recipes for when and how to prune tokens reflect a maturing understanding: not all tokens are equally important, and the right selection strategy preserves fidelity while cutting cost.

What’s Next: The Open Questions That Matter

1. Self-supervised priors or geometric rewards for physical realism? LaMo scales with data; Geo-Align gives guarantees. The likely answer: both, with priors for broad coverage and rewards for hard constraints.

2. Does entity-centric memory generalize? If EM-Vid’s approach extends beyond video to images, 3D, and multimodal systems, the memory bottleneck shifts from capacity to structure — from “how much can we store” to “how well do we organize it.”

3. Will training-free acceleration define the next generation? VDE and PiD show that speed and quality improvements don’t require retraining. If this holds across modalities, the cost of visual generation drops significantly.

The year’s research tells a clear story: visual generation is moving from brute-force rendering to intelligent planning. The models that win will be the ones that think about what to generate as carefully as how to generate it.

This analysis is part of the Frontier AI Research Digest backfill series (May 2025 – May 2026), surveying 37 papers across video generation, image generation, 3D reconstruction, and evaluation methods.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *