Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality

[GitHub Code] | [Paper]

This repository contains the official model weights for the paper "Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality", accepted as a long paper at ACL 2026.

📌 Introduction

MACCO (MAsked Compositional Concept MOdeling) is a framework designed to enhance compositional understanding in vision-language models (VLMs) like CLIP. It addresses the "bag-of-words" limitation by masking compositional concepts in one modality and reconstructing them conditioned on the full contextual information from the other modality. This process enables the model to capture and align cross-modal compositional structures—such as object relations and attribute-object bindings—more effectively than standard contrastive training.

💻 Usage

You can load these checkpoints using the open_clip library.

import open_clip
import torch

# Path to the downloaded .pt file (e.g., 'MACCO-CLIP-ViT-B-32.pt')
pretrained_path = 'path/to/MACCO-CLIP-ViT-B-32.pt'
device = "cuda" if torch.cuda.is_available() else "cpu"

# Create model and load the MACCO weights
model, _, image_preprocess = open_clip.create_model_and_transforms(
    'ViT-B-32', 
    pretrained=pretrained_path, 
    device=device
)
model = model.eval()

print("MACCO model loaded successfully!")

🖋️ Citation

If you find this work useful for your research, please consider citing:

@misc{li2026crossmodalmaskedcompositionalconcept,
      title={Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality}, 
      author={Wei Li and Zhen Huang and Xinmei Tian},
      year={2026},
      eprint={2606.13288},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.13288}, 
}

Downloads last month: -

Model tree for hiker-lw/MACCO

Base model

openai/clip-vit-base-patch16

Finetuned

(55)

this model

Paper for hiker-lw/MACCO

Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality

Paper • 2606.13288 • Published 5 days ago