Spaces:
Runtime error
Runtime error
| <html lang="en"> | |
| <head> | |
| <meta charset="utf-8"> | |
| <meta name="description" content="Causal Graphical Models for Vision-Language Compositional Understanding"> | |
| <meta name="keywords" content="Vision-and-Language, Compositionality, Retrieval"> | |
| <meta name="viewport" content="width=device-width, initial-scale=1"> | |
| <title>Causal Graphical Models for Vision-Language Compositional Understanding</title> | |
| <link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro" rel="stylesheet"> | |
| <link rel="stylesheet" href="static/css/bulma.min.css"> | |
| <link rel="stylesheet" href="static/css/bulma-carousel.min.css"> | |
| <link rel="stylesheet" href="static/css/bulma-slider.min.css"> | |
| <link rel="stylesheet" href="static/css/fontawesome.all.min.css"> | |
| <link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css"> | |
| <link rel="stylesheet" href="static/css/index.css"> | |
| <link rel="icon" href="static/images/favicon.png"> | |
| <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script> | |
| <script defer src="static/js/fontawesome.all.min.js"></script> | |
| <script src="static/js/bulma-carousel.min.js"></script> | |
| <script src="static/js/bulma-slider.min.js"></script> | |
| <script src="static/js/index.js"></script> | |
| <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/css/bootstrap.min.css"> | |
| <link href="https://fonts.googleapis.com/css2?family=Roboto:wght@300&display=swap" rel="stylesheet"> | |
| <style> | |
| body { | |
| font-family: 'Roboto', sans-serif; | |
| background-color: #e8f5e9; | |
| color: #333; | |
| line-height: 1.6; | |
| } | |
| .jumbotron { | |
| background: linear-gradient(135deg, #388e3c, #66bb6a); | |
| color: white; | |
| padding: 2rem 1rem; | |
| margin-bottom: 1rem; | |
| border-radius: 0.3rem; | |
| } | |
| .display-4 { | |
| font-size: 3rem; | |
| font-weight: 700; | |
| } | |
| .lead { | |
| font-size: 1rem; | |
| font-weight: 300; | |
| color: white; | |
| } | |
| .section { | |
| padding: 1.5rem 0; | |
| } | |
| .section-title { | |
| border-bottom: 2px solid #2e7d32; | |
| margin-bottom: 1rem; | |
| padding-bottom: 0.5rem; | |
| color: #1b5e20; | |
| } | |
| .qualitative-img { | |
| max-width: 100%; | |
| border-radius: 8px; | |
| transition: transform 0.3s ease-in-out; | |
| } | |
| .qualitative-img:hover { | |
| transform: scale(1.05); | |
| } | |
| .bibtex-block { | |
| background-color: #c8e6c9; | |
| padding: 1rem; | |
| border-radius: 0.25rem; | |
| overflow-x: auto; | |
| font-family: monospace; | |
| } | |
| .footer { | |
| text-align: center; | |
| padding: 1rem 0; | |
| background-color: #a5d6a7; | |
| } | |
| .lead a { | |
| color: white; | |
| text-decoration: none; | |
| } | |
| .lead a:hover { | |
| text-decoration: underline; | |
| } | |
| .author-link { | |
| font-family: monospace; | |
| font-style: italic; | |
| margin: 0 10px; | |
| } | |
| .iclr-space { | |
| margin: 10px 0; | |
| font-size: 20px; | |
| color: #333; | |
| } | |
| title { | |
| font-weight: bold; | |
| } | |
| .button-container { | |
| display: flex; | |
| justify-content: center; | |
| gap: 10px; | |
| margin-top: 20px; | |
| } | |
| .icon-button { | |
| background-color: #333; | |
| color: white; | |
| border: none; | |
| padding: 10px 20px; | |
| border-radius: 20px; | |
| display: flex; | |
| align-items: center; | |
| gap: 5px; | |
| cursor: pointer; | |
| } | |
| .icon { | |
| height: 20px; | |
| } | |
| .section-content { | |
| max-width: 800px; | |
| margin: 0 auto; | |
| } | |
| .init-content { | |
| max-width: 700px; | |
| margin: 0 auto; | |
| } | |
| </style> | |
| </head> | |
| <body> | |
| <div class="jumbotron text-center"> | |
| <img src="static/images/logo.png" alt="ICLR 2025 Logo" class="img-fluid mb-3" style="max-height: 80px;"> | |
| <h1 class="display-4">Causal Graphical Models for Vision-Language Compositional Understanding</h1> | |
| <p class="lead"> | |
| <span class="iclr-space" style="margin-bottom: 2rem;">ICLR 2025<br></span> | |
| <span class="author-link"><a href="https://github.com/FiorenzoParascandolo1" target="_blank">Fiorenzo Parascandolo</a></span> | |
| <span class="author-link"><a href="https://nicholasmoratelli.github.io" target="_blank">Nicholas Moratelli</a></span> | |
| <span class="author-link"><a href="https://aimagelab.ing.unimore.it/imagelab/person.asp?idpersona=144" target="_blank">Enver Sangineto</a></span> | |
| <span class="author-link"><a href="https://www.lorenzobaraldi.com/" target="_blank">Lorenzo Baraldi</a></span> | |
| <span class="author-link"><a href="https://aimagelab.ing.unimore.it/imagelab/person.asp?idpersona=1" target="_blank">Rita Cucchiara</a></span> <br> | |
| University of Modena and Reggio Emilia <br> | |
| </p> | |
| <div class="button-container"> | |
| <span class="link-block"> | |
| <a href="https://github.com/aimagelab/COGT" class="external-link button is-normal is-rounded is-dark"> | |
| <span class="icon"> | |
| <i class="fab fa-github"></i> | |
| </span> | |
| <span>Code</span> | |
| </a> | |
| </span> | |
| <span class="link-block"> | |
| <a href="https://arxiv.org/pdf/2412.09353" class="external-link button is-normal is-rounded is-dark"> | |
| <span class="icon"> | |
| <i class="ai ai-arxiv"></i> | |
| </span> | |
| <span>arXiv</span> | |
| </a> | |
| </span> | |
| <span class="link-block"> | |
| <a class="external-link button is-normal is-rounded is-dark"> | |
| 🤗 Models | |
| </a> | |
| </span> | |
| </div> | |
| </div> | |
| <div class="container section"> | |
| <div class="init-content"> | |
| <p><i> | |
| This paper introduces <u><b>COGT</b></u>, a novel approach for enhancing the compositional understanding of Vision-Language Models | |
| by modeling the dependency relations among textual and visual tokens using a Causal Graphical Model (CGM). | |
| </i></p> | |
| </div> | |
| </div> | |
| <div class="container section"> | |
| <div class="section-content"> | |
| <h1 class="section-title">Abstract</h1> | |
| <p> | |
| Recent work has empirically shown that Vision-Language Models (VLMs) struggle to fully understand the compositional | |
| properties of the human language, usually modeling an image caption as a "bag of words". In this paper, we model | |
| the dependency relations among textual and visual tokens using a <i><b>Causal Graphical Model (CGM)</b></i>, built using a | |
| <i><b>dependency parser</b></i>, and we train a decoder conditioned by the VLM visual encoder. Differently from standard | |
| autoregressive or parallel predictions, our decoder's generative process is partially-ordered following the CGM | |
| structure. This structure encourages the decoder to learn only the main causal dependencies in a sentence | |
| discarding spurious correlations. Using extensive experiments on five compositional benchmarks, we show that | |
| our method significantly outperforms all the state-of-the-art compositional approaches by a large margin, | |
| and it also improves over methods trained using much larger datasets. | |
| </p> | |
| </div> | |
| </div> | |
| <div class="container section"> | |
| <div class="section-content"> | |
| <h1 class="section-title">Method</h1> | |
| <h3 class="section-title" style="font-size: 1.5em; margin-top: 2rem;">Causal Graphical Model (CGM) Construction</h3> | |
| <p> | |
| <div style="text-align: center; margin-top: 2rem; margin-bottom: 2rem;"> | |
| <img src="static/images/method.png" alt="Qualitative Result 1" class="qualitative-img"> | |
| </div> | |
| <p> | |
| We use an off-the-shelf <i>dependency parser</i>, which creates a syntactic tree from a given textual sentence. Specifically, given a caption, a dependency parser automatically builds a <i>Dependency Tree</i> (DT), in which each node is associated with a caption word and each edge represents a syntactic dependency relation between two words. | |
| The DT, jointly with the visual features extracted from the image using a frozen visual encoder, are used to build a CGM, which describes the dependency relations among image patches and textual tokens. Our token prediction strategy is based on the dependency relations contained in this CGM. | |
| The rationale behind this approach is illustrated in the figure using the caption "A brown bird has a small yellow head". For instance, in the resulting DT, the adjective "brown" depends on the noun "bird". | |
| </p> | |
| </p> | |
| <h3 class="section-title" style="font-size: 1.5em; margin-top: 2rem;">Dependency Guided Attention for Token Prediction</h3> | |
| <p> | |
| <div style="text-align: center; margin-top: 2rem; margin-bottom: 2rem;"> | |
| <img src="static/images/architecture.png" alt="Qualitative Result 1" class="qualitative-img"> | |
| </div> | |
| This figure presents a high-level architecture of our decoder. Each block of \(\mathcal{D}\) is composed of two layers. | |
| In the first layer, we compute the self-attention of each masked embedding \(\mathbf{m}_j\) with itself, jointly with the attention of \(\mathbf{m}_j\) with all the visible embeddings \(\mathbf{v}_{i_1}, ..., \mathbf{v}_{i_k}\), where | |
| \[\mathbf{PA}(W_j) = \{ W_{i_1}, ..., W_{i_k}, S_j, Z_1, ..., Z_m \}.\] | |
| Note that there is no attention between \(\mathbf{m}_{j_1}\) and \(\mathbf{m}_{j_2}\), with \(j_1 \neq j_2\). | |
| In the same layer, we compute the self-attention of each visible embedding \(\mathbf{v}_j\) with itself, jointly with the attention of \(\mathbf{v}_j\) with \(\mathbf{v}_{i_1}, ..., \mathbf{v}_{i_k}\). | |
| Note that there is no information leak, since \(\mathbf{m}_j\), later used for the final prediction, has no direct or indirect access to \(\mathbf{v}_j\). | |
| We call this <em>Dependency Guided Attention</em> to differentiate it from the standard self-attention. | |
| In the second layer of each block of \(\mathcal{D}\), both the masked (\(\mathbf{m}_j\)) and the visible (\(\mathbf{v}_j\)) embeddings pay attention to the visual features in \(\mathcal{Z}\) using cross-attention, in this way implementing the dependence between \(W_j\) and \(Z_1, ..., Z_m\). | |
| Finally, after the last block of \(\mathcal{D}\) we discard the visible-token embeddings and we feed each masked-token final embedding to a linear layer computing a posterior distribution over the vocabulary of textual terms. | |
| </p> | |
| </div> | |
| </div> | |
| <div class="container section"> | |
| <div class="section-content"> | |
| <h1 class="section-title">Qualitative Results</h1> | |
| <div style="text-align: center; margin-top: 2rem;"> | |
| <img src="static/images/sugar_crepe.png" alt="Qualitative Result 1" class="qualitative-img"> | |
| <p class="caption">Qualitative results on sample images of SugarCrepe.</p> | |
| </div> | |
| <div style="text-align: center; margin-top: 2rem;"> | |
| <img src="static/images/color_swap.png" alt="Qualitative Result 2" class="qualitative-img"> | |
| <p class="caption">Qualitative results on sample images of ColorSwap.</p> | |
| </div> | |
| </div> | |
| </div> | |
| <div class="container section"> | |
| <h1 class="section-title">BibTeX</h1> | |
| <div class="bibtex-block"> | |
| <pre> | |
| @InProceedings{parascandolo2024causal, | |
| title={Causal Graphical Models for Vision-Language Compositional Understanding}, | |
| author={Parascandolo, Fiorenzo and Moratelli, Nicholas and Sangineto, Enver and Baraldi, Lorenzo and Cucchiara, Rita}, | |
| booktitle={Proceedings of The Thirteenth International Conference on Learning Representations, ICLR}, | |
| year={2025} | |
| } | |
| </pre> | |
| </div> | |
| </div> | |
| <footer class="footer"> | |
| <p>© 2025 Causal Graphical Models for Vision-Language Compositional Understanding</p> | |
| </footer> | |
| <script src="https://code.jquery.com/jquery-3.5.1.slim.min.js"></script> | |
| <script src="https://cdn.jsdelivr.net/npm/@popperjs/core@2.9.3/dist/umd/popper.min.js"></script> | |
| <script src="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/js/bootstrap.min.js"></script> | |
| <script type="text/javascript" async src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.7/MathJax.js?config=TeX-MML-AM_CHTML"></script> | |
| </body> | |
| </html> |