metadata

license: apache-2.0
language:
  - en
metrics:
  - accuracy
pipeline_tag: image-text-to-text

Introduction

We use the powerful TinyLLaVA Factory to create a super small image-text-to-text model with only 296M params.

The goal is to make it possible to run LLaVA models on edge devices (with few gigabytes of memory).

For LLM and vision tower, we choose OpenELM-270M-Instruct and facebook/dinov2-small, respectively.

Result

Category	# Samples	TP	FP	TN	FN	Accuracy	Precision	Recall	F1 Score	Yes Ratio
Adversarial	3000	1264	575	925	236	0.7297	0.6873	0.8427	0.7571	0.613
Popular	3000	1264	301	1199	236	0.8210	0.8077	0.8427	0.8248	0.5217
Random	2910	1264	290	1120	236	0.8192	0.8134	0.8427	0.8278	0.5340

Samples 5000, Accuracy 27%

Samples 4241, Correct: 1725, Accuracy: 40.64%, IMG-Accuracy: 36.54%