Edit model card

CODE

When Do We Not Need Larger Vision Models?

Model

This is a LLaVA-v1.5-7b model trained with S2-Wrapper, a simple approach to enable any vision model to perceive high-resolution images. We use image resolutions of up to 1008x1008 for this model.

Training

The training pipeline and dataset completely follow LLaVA-v1.5. We use LoRA to fine-tune the model.

Benchmarking

Version Size Schedule Checkpoint VQAv2 VizWiz TextVQA MMMU-val MathVista MM-Bench SEED MM-Vet
LLaVA-1.5 7B full_ft-1e liuhaotian/llava-v1.5-7b 78.5 50.0 58.2 36.2 25.2 64.3 65.7 31.1
LLaVA-1.5 7B lora-1e liuhaotian/llava-v1.5-7b-lora 79.1 47.8 58.2 - - 66.1 - 30.2
LLaVA-1.5-S2 7B lora-1e this model 80.0 50.1 61.0 37.7 25.3 66.2 67.9 32.4

License

Llama 2 is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.

Downloads last month
24