AKI Model Card

AKI is the official checkpoint for the paper "Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs". AKI is a multimodal foundation model that unlocks causal attention in the LLM into modality-mutual attention (MMA), which enables the earlier modality (images) to incorporate information from the latter modality (text) for addressing vision-language misalignment without introducing additional parameters and increasing training time.

Model Details

Model Descriptions

Vision Encoder: google/siglip-so400m-patch14-384
Vision-Language Connector: Perceiver Resampler
Language Decoder (LLM): microsoft/Phi-3.5-mini-instruct
Pretraining Datasets: Blip3-kale and Blip3-OCR-200m
SFT Datasets: VQAv2, GQA, VSR, OCRVQA, A-OKVQA, ScienceQA, RefCOCO, RefCOCOg, RefCOCO+, VisualGnome, LLaVA-150k

Model Sources

Repository: GitHub
Paper: arXiv

How to Use

Input Format

Given the nature of the training data, the AKI model is best suited for prompts using the chat format as follows:

<|system|>
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.<|end|>
<|user|>
<image>
Describe the scene of this image.
<|end|>
<|assistant|>

The image captures a beautiful autumn day in a park, with a pathway covered in a vibrant carpet of fallen leaves. The leaves are in various shades of red, orange, yellow, and brown, creating a warm and colorful atmosphere. The path is lined with trees displaying beautiful autumn foliage, adding to the picturesque setting. ...

Inference Example

Please refer to the notebook for the zero-shot inference. To build a local demo website, please refer to local_demo.py.

For the training scripts, please refer to the GitHub repo.

Evaluation Results

Main Comparisons with the Same Configurations (Table 1)

	MME^P	MME^C	MMB	SEED^I	LLaVA^W	MMMU	MathV^mini	POPE	MM-Vet	RealWorldQA	CV-Bench^2D	CV-Bench^3D
(I&T)_PT + (I&T)_SFT	1226.3	258.2	64.9	64.1	47.0	31.1	24.2	79.8	24.3	50.6	45.2	54.3
CCA [Xing et al., 2024]	1212.7	243.6	67.4	65.3	54.0	34.6	25.6	81.9	29.0	52.7	56.0	62.8
(w/o T&I)_PT	1046.3	226.4	31.7	45.1	38.1	27.2	23.8	65.0	17.2	40.1	53.2	54.8
(w/o I&T)_PT	1013.2	208.6	32.0	43.3	37.9	27.7	22.4	70.4	20.6	39.5	55.4	53.0
(w/o T&I)_SFT	1194.8	289.3	58.5	61.1	40.2	28.0	21.9	79.0	22.8	47.8	41.4	63.0
(w/o I&T)_SFT	1166.2	264.3	58.4	60.8	36.9	26.7	23.1	76.8	20.4	46.9	43.3	61.2
DOT (Ours)	1267.8	251.4	43.8	54.7	47.5	30.7	25.6	82.7	25.0	50.5	52.2	58.1
MMA (Ours)	1363.7	315.4	71.8	67.1	59.6	37.3	26.4	82.7	30.2	52.3	57.8	64.1
Improvements	10.9%	29.5%	4.3%	2.8%	10.4%	7.8%	3.1%	1%	4.1%	-	3.2%	2.1%

AKI-4B (Table 2)

	MME^P	MME^C	MMB	SEED^I	LLaVA^W	MMMU	MathV^mini	POPE	MM-Vet	RealWorldQA	CV-Bench^2D	CV-Bench^3D
AKI-4B	1491.9	362.9	73.1	69.4	74.6	38.7	32.1	86.9	40.8	58.9	62.1	71.8

Ethical Considerations

Note: This section is mainly taken from the xgen-mm models.

This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety.

License

Our code and weights are released under the CC-BY-NC 4.0 license.

The copyrights of the pre-training and finetuning data remain with the original data owner.

Citations

@misc{wywang2025AKI,
      title={Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs}, 
      author={Wei-Yao Wang and Zhao Wang and Helen Suzuki and Yoshiyuki Kobayashi},
      year={2025},
      eprint={2503.02597},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.02597}, 
}