File size: 1,447 Bytes
1cfe74a efa5237 b9f867d 500471d efa5237 b9f867d 1d504d2 b9f867d b9a84c1 1d504d2 77359f5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
---
license: apache-2.0
language:
- en
metrics:
- accuracy
pipeline_tag: image-text-to-text
---
# Introduction
We use the powerful [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) to create a super small image-text-to-text model with only 296M params.
The goal is to make it possible to run LLaVA models on edge devices (with few gigabytes of memory).
For LLM and vision tower, we choose [OpenELM-270M-Instruct](apple/OpenELM-270M-Instruct) and [facebook/dinov2-small](facebook/dinov2-small), respectively.
# Result
[POPE](https://tinyllava-factory.readthedocs.io/en/latest/Evaluation.html#pope):
| Category | # Samples | TP | FP | TN | FN | Accuracy | Precision | Recall | F1 Score | Yes Ratio |
|-------------|------------|------|-----|------|-----|----------|-----------|--------|----------|-----------|
| Adversarial | 3000 | 1264 | 575 | 925 | 236 | 0.7297 | 0.6873 | 0.8427 | 0.7571 | 0.613 |
| Popular | 3000 | 1264 | 301 | 1199 | 236 | 0.8210 | 0.8077 | 0.8427 | 0.8248 | 0.5217 |
| Random | 2910 | 1264 | 290 | 1120 | 236 | 0.8192 | 0.8134 | 0.8427 | 0.8278 | 0.5340 |
[TEXTVQA](https://tinyllava-factory.readthedocs.io/en/latest/Evaluation.html#textvqa)
Samples 5000, Accuracy 27%
[SCIENCEQA](https://tinyllava-factory.readthedocs.io/en/latest/Evaluation.html#scienceqa)
Samples 4241, Correct: 1725, Accuracy: 40.64%, IMG-Accuracy: 36.54%
|