metadata
license: apache-2.0
language:
- en
metrics:
- accuracy
pipeline_tag: image-text-to-text
Introduction
We use the powerful TinyLLaVA Factory to create a super small image-text-to-text model with only 296M params.
The goal is to make it possible to run LLaVA models on edge devices (with few gigabytes of memory).
For LLM and vision tower, we choose OpenELM-270M-Instruct and facebook/dinov2-small, respectively.
Result
POPE:
Category | # Samples | TP | FP | TN | FN | Accuracy | Precision | Recall | F1 Score | Yes Ratio |
---|---|---|---|---|---|---|---|---|---|---|
Adversarial | 3000 | 1264 | 575 | 925 | 236 | 0.7297 | 0.6873 | 0.8427 | 0.7571 | 0.613 |
Popular | 3000 | 1264 | 301 | 1199 | 236 | 0.8210 | 0.8077 | 0.8427 | 0.8248 | 0.5217 |
Random | 2910 | 1264 | 290 | 1120 | 236 | 0.8192 | 0.8134 | 0.8427 | 0.8278 | 0.5340 |
Samples 5000, Accuracy 27%
Samples 4241, Correct: 1725, Accuracy: 40.64%, IMG-Accuracy: 36.54%