True spatial intelligence for multimodal brokers transcends low-level geometric notion, evolving from realizing the place issues are to understanding what they’re for. Whereas present benchmarks, comparable to VSI-Bench, successfully consider this foundational geometric stage, they fall wanting probing the higher-order cognitive talents important for grounded intelligence. To bridge this hole, we introduce the Spatial-Practical Intelligence Benchmark (SFI-Bench), a video-based benchmark with over 1700 questions derived from numerous, selfish indoor video scans. SFI-Bench is designed to systematically consider two complementary dimensions of superior reasoning: (1) Structured Spatial Reasoning, understanding complicated layouts and forming coherent spatial representations, and (2) Practical Reasoning, inferring object affordances and context-dependent utility. Its duties, together with conditional counting, multi-hop relational reasoning, practical pairing, and knowledge-grounded troubleshooting, straight problem a mannequin’s capability to combine notion, reminiscence, and inference. Our experiments reveal that present MLLMs constantly wrestle to combine spatial reminiscence with practical and exterior data, highlighting a crucial bottleneck. SFI-Bench thus supplies an important instrument for measuring and driving progress in the direction of extra cognitively succesful and actually grounded multimodal brokers.
- † Mila, Université de Montréal
- ‡ New York College
- ** Work carried out whereas at Apple
