Edit model card

CODE

When Do We Not Need Larger Vision Models?

Model

This is a LLaVA-v1.5-13b model trained with S2-Wrapper, a simple approach to enable any vision model to perceive high-resolution images. We use image resolutions of up to 1008x1008 for this model.

Training

The training pipeline and dataset completely follow LLaVA-v1.5. We use LoRA to fine-tune the model.

Benchmarking

Version Size Schedule Checkpoint VQAv2 VizWiz TextVQA MMMU-val MathVista MM-Bench SEED MM-Vet
LLaVA-1.5 13B full_ft-1e liuhaotian/llava-v1.5-13b 80.0 53.6 61.3 36.4 27.6 67.7 68.2 36.1
LLaVA-1.5 13B lora-1e liuhaotian/llava-v1.5-13b-lora 80.0 58.9 60.2 - - 68.5 - 38.3
LLaVA-1.5-S2 13B lora-1e this model 80.9 56.0 63.1 37.4 27.8 67.9 68.9 36.4

License

Llama 2 is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.

Downloads last month
18