arxiv:2407.14412

DEAL: Disentangle and Localize Concept-level Explanations for VLMs

Published on Jul 19, 2024

Authors:

Tang Li ,

Abstract

Large pre-trained Vision-Language Models (VLMs) have become ubiquitous foundational components of other models and downstream tasks. Although powerful, our empirical results reveal that such models might not be able to identify fine-grained concepts. Specifically, the explanations of VLMs with respect to fine-grained concepts are entangled and mislocalized. To address this issue, we propose to DisEntAngle and Localize (DEAL) the concept-level explanations for VLMs without human annotations. The key idea is encouraging the concept-level explanations to be distinct while maintaining consistency with category-level explanations. We conduct extensive experiments and ablation studies on a wide range of benchmark datasets and vision-language models. Our empirical results demonstrate that the proposed method significantly improves the concept-level explanations of the model in terms of disentanglability and localizability. Surprisingly, the improved explainability alleviates the model's reliance on spurious correlations, which further benefits the prediction accuracy.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

No model linking this paper

Cite arxiv.org/abs/2407.14412 in a model README.md to link it from this page.

No dataset linking this paper

Cite arxiv.org/abs/2407.14412 in a dataset README.md to link it from this page.

No Space linking this paper

Cite arxiv.org/abs/2407.14412 in a Space README.md to link it from this page.

No Collection including this paper

Add this paper to a collection to link it from this page.