Week 22, 2026 — Vision & Multimodal Systems

Vision-language models made strides in high-resolution perception, 3D reasoning, video efficiency, and unified digital human generation.

CVSearch: Cognitive Visual Search for High-Resolution MLLMs

CVSearch by Liupeng Li et al. addresses the coverage-efficiency dilemma in high-resolution image perception for MLLMs. It dynamically schedules search strategies: first trying expert-assisted search, and only triggering a novel Semantic Guided Adaptive Patching (semantically consistent regions instead of rigid grids) when the expert fails. This eliminates object fragmentation, achieving SOTA accuracy on HR benchmarks with substantially improved efficiency. Training-free. Paper

GASP: Teaching VLMs 3D Without 3D QA Data

GASP (Geometric-Aware Spatial Priors) by Chun-Hsiao Yeh et al. argues that genuine spatial understanding should emerge from fundamental geometric priors, not VQA supervision. A small correspondence head applied as deep supervision across all layers, trained with contrastive losses on point correspondences and depth consistency from video scenes. Results: internal correspondence accuracy goes from below 5% to over 70%, translating to +18.2% on All-Angles Bench and +29.0% on VSI-Bench. No 3D VQA data used. Paper

VideoMLA: 92.7% KV Cache Reduction for Video Diffusion

VideoMLA by Hidir Yesiltepe et al. is the first study of Multi-Head Latent Attention (MLA) for video diffusion. Despite video attention not being low-rank, the MLA bottleneck becomes the effective rank during training, preserving quality. Reduces per-token KV memory by 92.7%, matches short-horizon baselines, achieves best score at long horizons, and improves throughput by 1.23x on a B200. Paper

GPIC: 28 Trillion Pixels of Permissively Licensed Images

Stanford Vision Lab (Li Fei-Fei’s group) released GPIC — a corpus of ~28 trillion pixels across 100M training images, all permissively licensed for research AND commercial use. Includes a reference pixel-space flow matching baseline and standardized benchmarking protocol. Available on HuggingFace. Paper

Archon: Unified 7-Modality Digital Human Generation

Archon by Chong Bao et al. unifies seven modalities (text, audio, motion, visual) in a single pretrained autoregressive framework for holistic avatar generation. A memory-efficient semantic video reparameterization achieves 4x token reduction for high-fidelity talking videos. “Thinking in Modality” decomposes ambiguous cross-modal tasks into stepwise chains. Paper

ETCHR: Image Editing as Reasoning Assistant

ETCHR (Editing To Clarify and Harness Reasoning) decouples image editing from understanding. A dedicated editor, trained via reasoning trajectory SFT and VLM-derived rewards, plugs into any open/closed-source MLLM without additional training. Raises Pass@1 by +4.82 on Qwen3-VL-8B, +5.47 on Gemini-3.1-Flash-Lite. Paper

Additional papers:
Reinforcement Learning with Robust Rubric Rewards (RLR³) — +4.7 points over Qwen3-VL-30B base model
Debiased Negative Mining for VLM OOD Detection — New SOTA for post-hoc OOD detection with VLMs
PhyGenHOI — Physically accurate 4D human-object interaction generation
Before the Shutter — 3D aesthetic portrait photography planning

Key insight: Vision is becoming 3D-aware, long-video capable, and more efficient. The separation of roles — editing vs. understanding, expert search vs. exhaustive search — is proving to be a winning paradigm.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *