When Do We Not Need Larger Vision Models?
Model
This is a LLaVA-v1.5-7b model trained with S2-Wrapper, a simple approach to enable any vision model to perceive high-resolution images. We use image resolutions of up to 1008x1008 for this model.
Training
The training pipeline and dataset completely follow LLaVA-v1.5. We use LoRA to fine-tune the model.
Benchmarking
Version | Size | Schedule | Checkpoint | VQAv2 | VizWiz | TextVQA | MMMU-val | MathVista | MM-Bench | SEED | MM-Vet |
---|---|---|---|---|---|---|---|---|---|---|---|
LLaVA-1.5 | 7B | full_ft-1e | liuhaotian/llava-v1.5-7b | 78.5 | 50.0 | 58.2 | 36.2 | 25.2 | 64.3 | 65.7 | 31.1 |
LLaVA-1.5 | 7B | lora-1e | liuhaotian/llava-v1.5-7b-lora | 79.1 | 47.8 | 58.2 | - | - | 66.1 | - | 30.2 |
LLaVA-1.5-S2 | 7B | lora-1e | this model | 80.0 | 50.1 | 61.0 | 37.7 | 25.3 | 66.2 | 67.9 | 32.4 |
License
Llama 2 is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.
- Downloads last month
- 229
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.