AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability

Authors: Fei Zhao*, Taotian Pang*, Chunhui Li, Zhen Wu, Junjie Guo, Shangyu Xing, Xinyu Dai

News and Updates

[5/24] 🔥 We released AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability. Checkout the paper and demo.

Model Zoo

Model	LLM	Vision Backbone	Pre-training	Instruct-tuning
AlignGPT-7B	Vicuna 7B	CLIP ViT-L/14	aligngpt-7b-pretrain	aligngpt-7b
AlignGPT-13B	Vicuna 13B	CLIP ViT-L/14	aligngpt-13b-pretrain	aligngpt-13b
AlignGPT-LLaMA2	LLaMA-2-7B-Chat	CLIP ViT-L/14	To be released	To be released
AlignGPT-LLaMA3	LLaMA-3-8B-Base	CLIP ViT-L/14	To be released	To be released

Performance

Model	VQAv2	GQA	VizWiz	SQA	T-VQA	POPE	MME	MM-Bench	MM-Bench-CN	SEED	LLaVA-Bench-Wild	MM-Vet
AlignGPT-7B	79.1	62.9	54.2	68.5	58.4	86.0	1527.4	67.3	59.9	66.5	68.4	30.8
AlignGPT-13B	80.0	63.6	56.4	70.3	60.2	86.2	1572.0	69.5	63.7	67.8	75.2	35.6

Citation

If you find AlignGPT useful for your research and applications, please cite using this BibTeX:

@misc{zhao2024aligngpt,
      title={AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability}, 
      author={Fei Zhao and Taotian Pang and Chunhui Li and Zhen Wu and Junjie Guo and Shangyu Xing and Xinyu Dai},
      year={2024},
      eprint={2405.14129},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

The data and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.