MindJourney: Test-Time Scaling with World Models for Spatial Reasoning
- Reuben Tan, Microsoft
The video introduces MindJourney, a framework that enhances Vision-Language Models (VLMs), which excel at interpreting single images but struggle to infer the underlying three-dimensional world. By allowing the VLM to “imagine” moving through the scene given a spatial reasoning question, the model proposes trajectories in a simulated imagination space. A world model then generates novel views along these paths, expanding the available observations from a single image. This richer 3D context enables the VLM to answer previously challenging questions with greater ease.
接下来观看
-
-
Session: Compute & Trust (Systems)
- Ashish Panwar,
- Aditya Desai,
- Abhilash Jindal
-
Multimodal & Embodied Intelligence (Pt 1), Panel on Multimodal AI: Progress, Pitfalls, Possibilities
- Madhava Krishna,
- Sriram Ganapathy,
- Somak Aditya
-
Session on Compute & Trust (Security)
- Krishna Pillutla,
- Danish Pruthi
-
-
Session on Reasoning
- Hongxiang Fan,
- Nagarajan Natarajan
-
-
Session on Retrieval
- Lokesh Nagalapatti,
- Soumen Chakrabarti
-
-