Wednesday, January 14, 2026

NarrativeTrack: Evaluating Video Language Fashions Past the Body


Multimodal massive language fashions (MLLMs) have achieved spectacular progress in vision-language reasoning, but their skill to grasp temporally unfolding narratives in movies stays underexplored. True narrative understanding requires grounding who’s doing what, when, and the place, sustaining coherent entity representations throughout dynamic visible and temporal contexts. We introduce NarrativeTrack, the primary benchmark to judge narrative understanding in MLLMs by fine-grained entity-centric reasoning. Not like present benchmarks restricted to brief clips or coarse scene-level semantics, we decompose movies into constituent entities and look at their continuity through a Compositional Reasoning Development (CRP), a structured analysis framework that progressively will increase narrative complexity throughout three dimensions: entity existence, entity modifications, and entity ambiguity. CRP challenges fashions to advance from temporal persistence to contextual evolution and fine-grained perceptual reasoning. A completely automated entity-centric pipeline allows scalable extraction of temporally grounded entity representations, offering the muse for CRP. Evaluations of state-of-the-art MLLMs reveal that fashions fail to robustly monitor entities throughout visible transitions and temporal dynamics, typically hallucinating id beneath context shifts. Open-source general-purpose MLLMs exhibit sturdy perceptual grounding however weak temporal coherence, whereas video-specific MLLMs seize temporal context but hallucinate entity’s contexts. These findings uncover a basic trade-off between perceptual grounding and temporal reasoning, indicating that narrative understanding emerges solely from their integration. NarrativeTrack supplies the primary systematic framework to diagnose and advance temporally grounded narrative comprehension in MLLMs.

Related Articles

Latest Articles