AKI Model Card
AKI
is the official checkpoint for the paper "Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs".
AKI is a multimodal foundation model that unlocks causal attention in the LLM into modality-mutual attention (MMA), which enables the earlier modality (images) to incorporate information from the latter modality (text) for addressing vision-language misalignment without introducing additional parameters and increasing training time.
Model Details
Model Descriptions
- Vision Encoder: google/siglip-so400m-patch14-384
- Vision-Language Connector: Perceiver Resampler
- Language Decoder (LLM): microsoft/Phi-3.5-mini-instruct
- Pretraining Datasets: Blip3-kale and Blip3-OCR-200m
- SFT Datasets: VQAv2, GQA, VSR, OCRVQA, A-OKVQA, ScienceQA, RefCOCO, RefCOCOg, RefCOCO+, VisualGnome, LLaVA-150k
Model Sources
How to Use
Input Format
Given the nature of the training data, the AKI model is best suited for prompts using the chat format as follows:
<|system|>
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.<|end|>
<|user|>
<image>
Describe the scene of this image.
<|end|>
<|assistant|>
The image captures a beautiful autumn day in a park, with a pathway covered in a vibrant carpet of fallen leaves. The leaves are in various shades of red, orange, yellow, and brown, creating a warm and colorful atmosphere. The path is lined with trees displaying beautiful autumn foliage, adding to the picturesque setting. ...
Inference Example
Please refer to the notebook for the zero-shot inference. To build a local demo website, please refer to local_demo.py.
For the training scripts, please refer to the GitHub repo.
Evaluation Results
Main Comparisons with the Same Configurations (Table 1)
MMEP | MMEC | MMB | SEEDI | LLaVAW | MMMU | MathVmini | POPE | MM-Vet | RealWorldQA | CV-Bench2D | CV-Bench3D | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
(I&T)PT + (I&T)SFT | 1226.3 | 258.2 | 64.9 | 64.1 | 47.0 | 31.1 | 24.2 | 79.8 | 24.3 | 50.6 | 45.2 | 54.3 |
CCA [Xing et al., 2024] | 1212.7 | 243.6 | 67.4 | 65.3 | 54.0 | 34.6 | 25.6 | 81.9 | 29.0 | 52.7 | 56.0 | 62.8 |
(w/o T&I)PT | 1046.3 | 226.4 | 31.7 | 45.1 | 38.1 | 27.2 | 23.8 | 65.0 | 17.2 | 40.1 | 53.2 | 54.8 |
(w/o I&T)PT | 1013.2 | 208.6 | 32.0 | 43.3 | 37.9 | 27.7 | 22.4 | 70.4 | 20.6 | 39.5 | 55.4 | 53.0 |
(w/o T&I)SFT | 1194.8 | 289.3 | 58.5 | 61.1 | 40.2 | 28.0 | 21.9 | 79.0 | 22.8 | 47.8 | 41.4 | 63.0 |
(w/o I&T)SFT | 1166.2 | 264.3 | 58.4 | 60.8 | 36.9 | 26.7 | 23.1 | 76.8 | 20.4 | 46.9 | 43.3 | 61.2 |
DOT (Ours) | 1267.8 | 251.4 | 43.8 | 54.7 | 47.5 | 30.7 | 25.6 | 82.7 | 25.0 | 50.5 | 52.2 | 58.1 |
MMA (Ours) | 1363.7 | 315.4 | 71.8 | 67.1 | 59.6 | 37.3 | 26.4 | 82.7 | 30.2 | 52.3 | 57.8 | 64.1 |
Improvements | 10.9% | 29.5% | 4.3% | 2.8% | 10.4% | 7.8% | 3.1% | 1% | 4.1% | - | 3.2% | 2.1% |
AKI-4B (Table 2)
MMEP | MMEC | MMB | SEEDI | LLaVAW | MMMU | MathVmini | POPE | MM-Vet | RealWorldQA | CV-Bench2D | CV-Bench3D | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
AKI-4B | 1491.9 | 362.9 | 73.1 | 69.4 | 74.6 | 38.7 | 32.1 | 86.9 | 40.8 | 58.9 | 62.1 | 71.8 |
Ethical Considerations
Note: This section is mainly taken from the xgen-mm models.
This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety.
License
Our code and weights are released under the CC-BY-NC 4.0 license.
The copyrights of the pre-training and finetuning data remain with the original data owner.
Citations
@misc{wywang2025AKI,
title={Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs},
author={Wei-Yao Wang and Zhao Wang and Helen Suzuki and Yoshiyuki Kobayashi},
year={2025},
eprint={2503.02597},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.02597},
}
- Downloads last month
- 5