VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Security

November 30, 2025

118

This paper was accepted on the Studying from Evaluating the Evolving LLM Lifecycle workshop at NeurIPS 2025.

Security analysis of multimodal basis fashions typically treats imaginative and prescient and language inputs individually, lacking dangers from joint interpretation the place benign content material turns into dangerous together. Current approaches additionally fail to tell apart clearly unsafe content material from borderline circumstances, resulting in problematic over-blocking or under-refusal of genuinely dangerous content material. We current Imaginative and prescient Language Security Understanding (VLSU), a complete framework to systematically consider multimodal security by fine-grained severity classification and combinatorial evaluation throughout 17 distinct security patterns. Utilizing a multi-stage pipeline with real-world photos and human annotation, we assemble a large-scale benchmark of 8,187 samples spanning 15 hurt classes. Our analysis of 11 state-of-the-art fashions reveals systematic joint understanding failures: whereas fashions obtain 90%-plus accuracy on clear unimodal security alerts, efficiency degrades considerably to 20-55% when joint image-text reasoning is required to find out the protection label. Most critically, 34% of errors in joint image-text security classification happen regardless of right classification of the person modalities, additional demonstrating absent compositional reasoning capabilities. Moreover, we discover that fashions wrestle to steadiness refusing unsafe content material whereas nonetheless responding to borderline circumstances that deserve engagement. For instance, we discover that instruction framing can scale back the over-blocking charge on borderline content material from 62.4% to 10.4% in Gemini-1.5, however solely at the price of under-refusing on unsafe content material with refusal charge dropping from 90.8% to 53.9%. General, our framework exposes weaknesses in joint image-text understanding and alignment gaps in present fashions, and supplies a important take a look at mattress to allow the following milestones in analysis on sturdy vision-language security.

VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Security

Related Articles

NASA’s Artemis moon exploration programme is getting a serious makeover

Closing out tabs: Saturday version

Past the Controller: Architecting Decentralized Intelligence in SD-WAN

Latest Articles

NASA’s Artemis moon exploration programme is getting a serious makeover

Closing out tabs: Saturday version

Past the Controller: Architecting Decentralized Intelligence in SD-WAN

The Obtain: how AI is shaking up Go, and a cybersecurity thriller

US strikes Iran: Trump’s battle, briefly defined