AKI Model Card

AKI is the official checkpoint for the paper "Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs". AKI is a multimodal foundation model that unlocks causal attention in the LLM into modality-mutual attention (MMA), which enables the earlier modality (images) to incorporate information from the latter modality (text) for addressing vision-language misalignment without introducing additional parameters and increasing training time.

Model Details

Model Descriptions

Model Sources

How to Use

Input Format

Given the nature of the training data, the AKI model is best suited for prompts using the chat format as follows:

<|system|>
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.<|end|>
<|user|>
<image>
Describe the scene of this image.
<|end|>
<|assistant|>

The image captures a beautiful autumn day in a park, with a pathway covered in a vibrant carpet of fallen leaves. The leaves are in various shades of red, orange, yellow, and brown, creating a warm and colorful atmosphere. The path is lined with trees displaying beautiful autumn foliage, adding to the picturesque setting. ...

Inference Example

Please refer to the notebook for the zero-shot inference. To build a local demo website, please refer to local_demo.py.

For the training scripts, please refer to the GitHub repo.

Evaluation Results

Main Comparisons with the Same Configurations (Table 1)

MMEP MMEC MMB SEEDI LLaVAW MMMU MathVmini POPE MM-Vet RealWorldQA CV-Bench2D CV-Bench3D
(I&T)PT + (I&T)SFT 1226.3 258.2 64.9 64.1 47.0 31.1 24.2 79.8 24.3 50.6 45.2 54.3
CCA [Xing et al., 2024] 1212.7 243.6 67.4 65.3 54.0 34.6 25.6 81.9 29.0 52.7 56.0 62.8
(w/o T&I)PT 1046.3 226.4 31.7 45.1 38.1 27.2 23.8 65.0 17.2 40.1 53.2 54.8
(w/o I&T)PT 1013.2 208.6 32.0 43.3 37.9 27.7 22.4 70.4 20.6 39.5 55.4 53.0
(w/o T&I)SFT 1194.8 289.3 58.5 61.1 40.2 28.0 21.9 79.0 22.8 47.8 41.4 63.0
(w/o I&T)SFT 1166.2 264.3 58.4 60.8 36.9 26.7 23.1 76.8 20.4 46.9 43.3 61.2
DOT (Ours) 1267.8 251.4 43.8 54.7 47.5 30.7 25.6 82.7 25.0 50.5 52.2 58.1
MMA (Ours) 1363.7 315.4 71.8 67.1 59.6 37.3 26.4 82.7 30.2 52.3 57.8 64.1
Improvements 10.9% 29.5% 4.3% 2.8% 10.4% 7.8% 3.1% 1% 4.1% - 3.2% 2.1%

AKI-4B (Table 2)

MMEP MMEC MMB SEEDI LLaVAW MMMU MathVmini POPE MM-Vet RealWorldQA CV-Bench2D CV-Bench3D
AKI-4B 1491.9 362.9 73.1 69.4 74.6 38.7 32.1 86.9 40.8 58.9 62.1 71.8

Ethical Considerations

Note: This section is mainly taken from the xgen-mm models.

This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety.

License

Our code and weights are released under the CC-BY-NC 4.0 license.

The copyrights of the pre-training and finetuning data remain with the original data owner.

Citations

@misc{wywang2025AKI,
      title={Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs}, 
      author={Wei-Yao Wang and Zhao Wang and Helen Suzuki and Yoshiyuki Kobayashi},
      year={2025},
      eprint={2503.02597},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.02597}, 
}
Downloads last month
5
Safetensors
Model size
4.33B params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.