When Do We Not Need Larger Vision Models?

Model

This is a LLaVA-v1.5-13b model trained with S²-Wrapper, a simple approach to enable any vision model to perceive high-resolution images. We use image resolutions of up to 1008x1008 for this model.

Training

The training pipeline and dataset completely follow LLaVA-v1.5. We use LoRA to fine-tune the model.

Benchmarking

Version	Size	Schedule	Checkpoint	VQAv2	VizWiz	TextVQA	MMMU-val	MathVista	MM-Bench	SEED	MM-Vet
LLaVA-1.5	13B	full_ft-1e	liuhaotian/llava-v1.5-13b	80.0	53.6	61.3	36.4	27.6	67.7	68.2	36.1
LLaVA-1.5	13B	lora-1e	liuhaotian/llava-v1.5-13b-lora	80.0	58.9	60.2	-	-	68.5	-	38.3
LLaVA-1.5-S2	13B	lora-1e	this model	80.9	56.0	63.1	37.4	27.8	67.9	68.9	36.4

bfshi
/

llava-v1.5-13b-s2-lora

When Do We Not Need Larger Vision Models?

Model

Training

Benchmarking

License