ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model
β‘ALLaVA is a project that provides a large-scale GPT4V-synthesized dataset for training LVLMs.β‘
π Paper β’ π Demo β’ π¨π»βπ» Github
π€ ALLaVA-4V Dataset
π€ ALLaVA-Phi3-mini-128k β’ π€ ALLaVA-StableLM2-1_6B β’ π€ ALLaVA-Phi2-2_7B
Benchmark Result
Our models ALLaVA-Phi3-mini-128k, ALLaVA-StableLM2-1_6B and ALLaVA-Phi2-2_7B achieve competitive results on 17 benchmarks.
Models | Vicuna-80 | GQA | HallusionBench | MME-P | MMVP | TouchStone | TextVQA | MME-C | MathVista | MM-Vet | MMMU-val | SQA (img) | LLaVA (In-the-Wild) | MLLM-Bench | MMB-en | MMB-cn | SEEDBench (img, v1) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Large VLMs | |||||||||||||||||
BLIP-2 | - | - | - | - | - | - | - | - | - | 22.4 | 34.4 | - | - | 3.0* | - | - | 49.7 |
InstructBLIP | - | 49.5 | - | - | - | - | - | - | - | 25.6 | - | - | 58.2 | - | 44.0 | - | - |
Qwen-VL-Chat | - | 57.5 | - | 1487.6 | - | - | 61.5 | 360.7 | - | 31.1 | - | 68.2 | - | - | 60.6 | 56.7 | 65.4 |
LLaVA-1.5-7B | 13.8* | 62.0 | 36.6* | 1504.4* | 24.7* | 594.9* | 58.2 | 324.6* | 25.0* | 31.1 | 35.1* | 66.8 | 65.4 | 23.0* | 64.3 | 58.3 | 66.1 |
LLaVA-1.5-13B | 22.5 | 63.3 | 36.5* | 1531.3 | 38.0* | 617.7* | 61.3 | 295.4 | 28.3* | 35.4 | 34.4* | 71.6 | 72.5 | - | 67.7 | 63.6 | 68.2 |
LVIS-7B | - | 62.6 | - | - | - | - | 58.7 | - | - | 31.5 | - | - | 67.0 | 29.0* | 66.2 | - | - |
LVIS-13B | - | 63.6* | - | - | - | - | 62.5* | - | - | 37.4* | - | - | 71.3* | - | 68.0* | - | - |
ShareGPT4V-7B | 13.8* | 63.3 | 36.0* | 1540.1* | 34.0* | 637.2* | 60.4 | 346.1* | 24.7* | 37.6 | 35.4* | 68.4* | 72.6 | 30.2* | 68.8 | 61.0* | 69.7 |
ShareGPT4V-13B | 17.5* | 64.8 | 39.0* | 1576.1* | 35.3* | 648.7* | 62.2 | 309.3* | 28.8* | 43.1 | 35.6* | 70.0* | 79.9 | 35.5* | 71.2 | 61.7* | 70.8 |
4B-scale Lite VLMs | |||||||||||||||||
MobileVLM-v2 | 5.0* | 61.1 | 30.8* | 1440.5 | 18.7* | 541.0* | 57.5 | 261.8* | 28.3* | 26.1* | 30.8* | 70.0 | 53.2* | 15.7* | 63.2 | 43.2* | 64.5* |
Mipha-3B | 16.2* | 63.9 | 34.3* | 1488.9 | 32.0* | 619.0* | 56.6 | 285.0* | 27.8* | 33.5* | 35.8* | 70.9 | 64.7* | 23.1* | 69.7 | 42.9* | 71.2* |
TinyLLaVA | 15.6* | 62.1 | 37.2* | 1465.5* | 33.3* | 663.5* | 60.3 | 281.1* | 30.3* | 37.5 | 38.4 | 73.0 | 70.8* | 29.8* | 69.7* | 42.8* | 70.4* |
Ours | |||||||||||||||||
ALLaVA-Phi2 | 49.4 | 48.8 | 24.8 | 1316.2 | 36.0 | 632.0 | 49.5 | 301.8 | 27.4 | 32.2 | 35.3 | 67.6 | 69.4 | 43.6 | 64.0 | 40.8 | 65.2 |
ALLaVA-StableLM2 | 38.8 | 49.8 | 25.3 | 1311.7 | 34.0 | 655.2 | 51.7 | 257.9 | 27.7 | 31.7 | 33.3 | 64.7 | 72.0 | 39.3 | 64.6 | 49.8 | 65.7 |
ALLaVA-Phi3 | 56.9 | 52.2 | 48.1 | 1382.3 | 32.7 | 667.8 | 53.0 | 347.1 | 32.9 | 37.8 | 41.1 | 64.0 | 68.5 | 54.8 | 68.1 | 55.3 | 69.0 |
* denotes the results of our evaluation. Bold numbers are the best results among all 4B-scale LVLMs.The detailed information of each benchmark is shown in Table 4 of our technical report.
π Inference
All models can be loaded from π€ with .from_pretrained()
.
Check out the example scripts and make sure you have the same outputs as shown in the scripts.
ποΈββοΈ Training
Data
ALLaVA uses 1.0M and 1.5M data for PT. and FT., respectively.
Code
The training code is largely based on LLaVA-v1.5. We wholeheartedly express our gratitude for their invaluable contributions to open-sourcing LVLMs.
Hyperparameters
Global Batch Size | ZeRO Stage | Optimizer | Max LR | Min LR | Scheduler | Weight decay |
---|---|---|---|---|---|---|
256 (PT) / 128 (FT) | 1 | AdamW | 2e-5 | 2e-6 | CosineAnnealingWarmRestarts | 0 |
The LM backbone, projector are trainable, while the vision encoder is kept frozen. The trainabilities of each module are the same for both stages.
π ALLaVA-4V Data
The majority part of training data is ALLaVA-4V. See here to prepare it for training.
π Contributors
Project Leader: Guiming Hardy Chen
Data: Shunian Chen, Junying Chen, Xiangbo Wu
Evaluation: Ruifei Zhang
Deployment: Xiangbo Wu, Zhiyi Zhang
Advising: Zhihong Chen, Benyou Wang
Others: Jianquan Li, Xiang Wan
π Citation
If you find our data useful, please consider citing our work! We are FreedomIntelligence from Shenzhen Research Institute of Big Data and The Chinese University of Hong Kong, Shenzhen
@article{chen2024allava,
title={ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model},
author={Chen, Guiming Hardy and Chen, Shunian and Zhang, Ruifei and Chen, Junying and Wu, Xiangbo and Zhang, Zhiyi and Chen, Zhihong and Li, Jianquan and Wan, Xiang and Wang, Benyou},
journal={arXiv preprint arXiv:2402.11684},
year={2024}
}
- Downloads last month
- 25