{"id":49,"date":"2026-05-31T13:27:54","date_gmt":"2026-05-31T17:27:54","guid":{"rendered":"https:\/\/monizesairesearch.com\/index.php\/2026\/05\/31\/weekly-research-digest-3\/"},"modified":"2026-05-31T20:26:08","modified_gmt":"2026-06-01T00:26:08","slug":"weekly-research-digest-3","status":"publish","type":"post","link":"https:\/\/monizesairesearch.com\/index.php\/2026\/05\/31\/weekly-research-digest-3\/","title":{"rendered":"Week 22, 2026 \u2014 Vision &#038; Multimodal Systems"},"content":{"rendered":"<p>Vision-language models made strides in high-resolution perception, 3D reasoning, video efficiency, and unified digital human generation.<\/p>\n<h2>CVSearch: Cognitive Visual Search for High-Resolution MLLMs<\/h2>\n<p><strong>CVSearch<\/strong> by Liupeng Li et al. addresses the coverage-efficiency dilemma in high-resolution image perception for MLLMs. It dynamically schedules search strategies: first trying expert-assisted search, and <em>only<\/em> triggering a novel Semantic Guided Adaptive Patching (semantically consistent regions instead of rigid grids) when the expert fails. This eliminates object fragmentation, achieving SOTA accuracy on HR benchmarks with substantially improved efficiency. Training-free. <a href=\"https:\/\/arxiv.org\/abs\/2605.23655v1\">Paper<\/a><\/p>\n<h2>GASP: Teaching VLMs 3D Without 3D QA Data<\/h2>\n<p><strong>GASP (Geometric-Aware Spatial Priors)<\/strong> by Chun-Hsiao Yeh et al. argues that genuine spatial understanding should emerge from fundamental geometric priors, not VQA supervision. A small correspondence head applied as deep supervision across all layers, trained with contrastive losses on point correspondences and depth consistency from video scenes. Results: internal correspondence accuracy goes from below 5% to over 70%, translating to +18.2% on All-Angles Bench and +29.0% on VSI-Bench. No 3D VQA data used. <a href=\"https:\/\/arxiv.org\/abs\/2605.30231v1\">Paper<\/a><\/p>\n<h2>VideoMLA: 92.7% KV Cache Reduction for Video Diffusion<\/h2>\n<p><strong>VideoMLA<\/strong> by Hidir Yesiltepe et al. is the first study of Multi-Head Latent Attention (MLA) for video diffusion. Despite video attention not being low-rank, the MLA bottleneck becomes the effective rank during training, preserving quality. Reduces per-token KV memory by 92.7%, matches short-horizon baselines, achieves best score at long horizons, and improves throughput by 1.23x on a B200. <a href=\"https:\/\/arxiv.org\/abs\/2605.30351v1\">Paper<\/a><\/p>\n<h2>GPIC: 28 Trillion Pixels of Permissively Licensed Images<\/h2>\n<p>Stanford Vision Lab (Li Fei-Fei&#8217;s group) released <strong>GPIC<\/strong> \u2014 a corpus of ~28 trillion pixels across 100M training images, all permissively licensed for research AND commercial use. Includes a reference pixel-space flow matching baseline and standardized benchmarking protocol. Available on HuggingFace. <a href=\"https:\/\/arxiv.org\/abs\/2605.30341v1\">Paper<\/a><\/p>\n<h2>Archon: Unified 7-Modality Digital Human Generation<\/h2>\n<p><strong>Archon<\/strong> by Chong Bao et al. unifies seven modalities (text, audio, motion, visual) in a single pretrained autoregressive framework for holistic avatar generation. A memory-efficient semantic video reparameterization achieves 4x token reduction for high-fidelity talking videos. &#8220;Thinking in Modality&#8221; decomposes ambiguous cross-modal tasks into stepwise chains. <a href=\"https:\/\/arxiv.org\/abs\/2605.30311v1\">Paper<\/a><\/p>\n<h2>ETCHR: Image Editing as Reasoning Assistant<\/h2>\n<p><strong>ETCHR<\/strong> (Editing To Clarify and Harness Reasoning) decouples image editing from understanding. A dedicated editor, trained via reasoning trajectory SFT and VLM-derived rewards, plugs into any open\/closed-source MLLM without additional training. Raises Pass@1 by +4.82 on Qwen3-VL-8B, +5.47 on Gemini-3.1-Flash-Lite. <a href=\"https:\/\/arxiv.org\/abs\/2605.23897v1\">Paper<\/a><\/p>\n<p><strong>Additional papers:<\/strong><br \/>\n&#8211; <a href=\"https:\/\/arxiv.org\/abs\/2605.30244v1\">Reinforcement Learning with Robust Rubric Rewards (RLR\u00b3)<\/a> \u2014 +4.7 points over Qwen3-VL-30B base model<br \/>\n&#8211; <a href=\"https:\/\/arxiv.org\/abs\/2605.23797v1\">Debiased Negative Mining for VLM OOD Detection<\/a> \u2014 New SOTA for post-hoc OOD detection with VLMs<br \/>\n&#8211; <a href=\"https:\/\/arxiv.org\/abs\/2605.30268v1\">PhyGenHOI<\/a> \u2014 Physically accurate 4D human-object interaction generation<br \/>\n&#8211; <a href=\"https:\/\/arxiv.org\/abs\/2605.30318v1\">Before the Shutter<\/a> \u2014 3D aesthetic portrait photography planning<\/p>\n<p><strong>Key insight:<\/strong> Vision is becoming 3D-aware, long-video capable, and more efficient. The separation of roles \u2014 editing vs. understanding, expert search vs. exhaustive search \u2014 is proving to be a winning paradigm.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Vision-language models made strides in high-resolution perception, 3D reasoning, video efficiency, and unified digital human generation. CVSearch: Cognitive Visual Search for High-Resolution MLLMs CVSearch by Liupeng Li et al. addresses the coverage-efficiency dilemma in high-resolution image perception for MLLMs. It dynamically schedules search strategies: first trying expert-assisted search, and only triggering a novel Semantic Guided [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":94,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4,16],"tags":[],"class_list":["post-49","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-topic-03","category-weekly-digest"],"_links":{"self":[{"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/posts\/49","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/comments?post=49"}],"version-history":[{"count":2,"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/posts\/49\/revisions"}],"predecessor-version":[{"id":79,"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/posts\/49\/revisions\/79"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/media\/94"}],"wp:attachment":[{"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/media?parent=49"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/categories?post=49"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/monizesairesearch.com\/index.php\/wp-json\/wp\/v2\/tags?post=49"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}