CompoDistill: Attention Distillation for Compositional Reasoning in Multimodal LLMs
Paper • 2510.12184 • Published
The teacher MLLM (Qwen1.5-4B + SigLIP-so400m) trained with LLaVA-style visual instruction tuning. Serves as the distillation teacher for CompoDistill-2B; can be passed directly to scripts/train/dpt.sh / dft.sh via --pretrained_teacher_model_path.
Released with the paper CompoDistill: Attention Distillation for Compositional Reasoning in Multimodal LLMs (arXiv:2510.12184). Training and evaluation code: https://github.com/ptkjw1997/CompoDistill
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoImageProcessor
repo = "JiwanKim/CompoDistill-Teacher-4B"
model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True,
torch_dtype=torch.float16).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(repo, use_fast=False)
image_processor = AutoImageProcessor.from_pretrained(repo)
image = Image.open("example.jpg")
print(model.chat("What is happening in this image?", tokenizer,
image=image, image_processor=image_processor))
@article{kim2025compodistill,
title={CompoDistill: Attention Distillation for Compositional Reasoning in Multimodal LLMs},
author={Kim, Jiwan and Kim, Kibum and Seo, Sangwoo and Park, Chanyoung},
journal={arXiv preprint arXiv:2510.12184},
year={2025}
}