sbrzz's picture
Update README.md
77359f5 verified
|
raw
history blame
1.45 kB
metadata
license: apache-2.0
language:
  - en
metrics:
  - accuracy
pipeline_tag: image-text-to-text

Introduction

We use the powerful TinyLLaVA Factory to create a super small image-text-to-text model with only 296M params.

The goal is to make it possible to run LLaVA models on edge devices (with few gigabytes of memory).

For LLM and vision tower, we choose OpenELM-270M-Instruct and facebook/dinov2-small, respectively.

Result

POPE:

Category # Samples TP FP TN FN Accuracy Precision Recall F1 Score Yes Ratio
Adversarial 3000 1264 575 925 236 0.7297 0.6873 0.8427 0.7571 0.613
Popular 3000 1264 301 1199 236 0.8210 0.8077 0.8427 0.8248 0.5217
Random 2910 1264 290 1120 236 0.8192 0.8134 0.8427 0.8278 0.5340

TEXTVQA

Samples 5000, Accuracy 27%

SCIENCEQA

Samples 4241, Correct: 1725, Accuracy: 40.64%, IMG-Accuracy: 36.54%