bfshi
/

llava-v1.5-13b-s2-lora

Text Generation

Inference Endpoints

Model card Files Files and versions Community

llava-v1.5-13b-s2-lora / README.md

bfshi's picture

Update README.md

14a87c3 verified 2 months ago

|

raw history blame contribute delete

No virus

1.4 kB

	---
	{}
	---
	[![CODE](https://img.shields.io/badge/GitHub-Repository-<COLOR>)](https://github.com/bfshi/scaling_on_scales)

	# When Do We Not Need Larger Vision Models?

	## Model

	This is a LLaVA-v1.5-13b model trained with [S<sup>2</sup>-Wrapper](https://github.com/bfshi/scaling_on_scales), a simple approach to enable any vision model to perceive high-resolution images. We use image resolutions of up to 1008x1008 for this model.

	## Training

	The training pipeline and dataset completely follow [LLaVA-v1.5](https://github.com/haotian-liu/LLaVA/tree/main). We use LoRA to fine-tune the model.

	## Benchmarking

	\| Version \| Size \| Schedule \| Checkpoint \| VQAv2 \| VizWiz \| TextVQA \| MMMU-val \| MathVista \| MM-Bench \| SEED \| MM-Vet \|
	\|----------\|----------\|-----------\|-----------\|---\|---\|---\|---\|---\|---\|---\|---\|
	\| LLaVA-1.5 \| 13B \| full_ft-1e \| [liuhaotian/llava-v1.5-13b](https://huggingface.co/liuhaotian/llava-v1.5-13b) \| 80.0 \| 53.6 \| 61.3 \| 36.4 \| 27.6 \| 67.7 \| 68.2 \| 36.1 \|
	\| LLaVA-1.5 \| 13B \| lora-1e \| [liuhaotian/llava-v1.5-13b-lora](https://huggingface.co/liuhaotian/llava-v1.5-13b-lora) \| 80.0 \| 58.9 \| 60.2 \| - \| - \| 68.5 \| - \| 38.3 \|
	\| LLaVA-1.5-S2 \| 13B \| lora-1e \| this model \| 80.9 \| 56.0 \| 63.1 \| 37.4 \| 27.8 \| 67.9 \| 68.9 \| 36.4 \|
	## License
	Llama 2 is licensed under the LLAMA 2 Community License,
	Copyright (c) Meta Platforms, Inc. All Rights Reserved.