Wednesday, March 25, 2026

SafetyPairs: Isolating Security Essential Picture Options with Counterfactual Picture Era


This paper was accepted on the Principled Design for Reliable AI — Interpretability, Robustness, and Security throughout Modalities Workshop at ICLR 2026.

What precisely makes a specific picture unsafe? Systematically differentiating between benign and problematic photos is a difficult drawback, as delicate adjustments to a picture, resembling an insulting gesture or image, can drastically alter its security implications. Nevertheless, current picture security datasets are coarse and ambiguous, providing solely broad security labels with out isolating the precise options that drive these variations. We introduce SafetyPairs, a scalable framework for producing counterfactual pairs of photos, that differ solely within the options related to the given security coverage, thus flipping their security label. By leveraging picture modifying fashions, we make focused adjustments to photographs that alter their security labels whereas leaving safety-irrelevant particulars unchanged. Utilizing SafetyPairs, we assemble a brand new security benchmark, which serves as a strong supply of analysis knowledge that highlights weaknesses in vision-language fashions’ talents to tell apart between subtly completely different photos. Past analysis, we discover our pipeline serves as an efficient knowledge augmentation technique that improves the pattern effectivity of coaching light-weight guard fashions. We launch a benchmark containing over 3,020 SafetyPair photos spanning a various taxonomy of 9 security classes, offering the primary systematic useful resource for learning fine-grained picture security distinctions.

Related Articles

Latest Articles