Edit model card

CODE

When Do We Not Need Larger Vision Models?

Model

This is a LLaVA-v1.5-7b model trained with S2-Wrapper, a simple approach to enable any vision model to perceive high-resolution images. We use image resolutions of up to 1008x1008 for this model.

Training

The training pipeline and dataset completely follow LLaVA-v1.5. We use LoRA to fine-tune the model.

Benchmarking

Version Size Schedule Checkpoint VQAv2 VizWiz TextVQA MMMU-val MathVista MM-Bench SEED MM-Vet
LLaVA-1.5 7B full_ft-1e liuhaotian/llava-v1.5-7b 78.5 50.0 58.2 36.2 25.2 64.3 65.7 31.1
LLaVA-1.5 7B lora-1e liuhaotian/llava-v1.5-7b-lora 79.1 47.8 58.2 - - 66.1 - 30.2
LLaVA-1.5-S2 7B lora-1e this model 80.0 50.1 61.0 37.7 25.3 66.2 67.9 32.4

License

Llama 2 is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.

Downloads last month
229
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.