REVERSE-v1.5-7B
Model Summary
REVERSE-v1.5-7B is a novel open-source vision-language model (VLM) that performs both next-token predictioin and self-verification / self-correction during the generation process. Built on top of LLaVA-v1.5-7B, it is fine-tuned using the REVERSE Visual Instruct 1.3M dataset and equipped with a retrospective resampling mechanism that allows it to detect and correct hallucinations during generation. The model is trained in early March, 2025.
Performance
REVERSE achieves state-of-the-art hallucination reduction across a wide range of captioning and open-ended visual question answering benchmarks:
Benchmark | Metric | Best Baseline | REVERSE (Ο=0.003) | REVERSE (Ο=0.0003) |
---|---|---|---|---|
CHAIR-MSCOCO | CHAIR (β) | HA-DPO (11.0) | 10.3 | 6.1 |
CHAIRs (β) | EOS (38.2) | 37.0 | 13.6 | |
AMBER-G | Hallucination (β) | EOS (5.1) | 6.0 | 4.0 |
Coverage (β) | HALVA (53.0) | 52.2 | 26.9 | |
MMHal-Bench | Score (β) | DoLA (2.33) | 2.56 | 3.28 |
Hallucination Rate (β) | HACL (0.50) | 0.47 | 0.30 | |
HaloQuest | Avg. Accuracy (β) | HALVA (23.9) | 30.7 | 32.3 |
False Premise Acc. (β) | HALVA (21.1) | 31.8 | 29.4 | |
Visual Challenging Acc. (β) | DoLA (40.1) | 31.5 | 18.7 | |
Insufficient Context Acc. (β) | HALVA (10.7) | 26.9 | 58.8 |
It also performs competitively on discriminative tasks compared with the base VLM.
Benchmark | Metric | LLaVA-v1.5-7B | REVERSE (Ο=0.5) |
---|---|---|---|
AMBER-D | F1 Score (β) | 74.7 | 74.2 |
POPE | F1 Score (β) | 85.9 | 85.9 |
MME-Hall | Score (β) | 648.3 | 601.6 |
Usage
Please refer to the installation guide on GitHub to get started:
π Installation Guide
Additional Resources
- π Project Page: https://reverse-vlm.github.io/
- π§Ύ Dataset: REVERSE Visual Instruct 1.3M
- π§ Ask Questions: GitHub Issues
Intended Use
Primary Use Cases:
- Reducing hallucination in image captioning and VQA tasks
- Benchmarking hallucination-aware generation
- Research on grounded vision-language generation and self-correction
Target Users:
Researchers, developers, and students working in computer vision, NLP, and multimodal AI.
- Downloads last month
- 0
Model tree for tsunghanwu/reverse_llava_v15
Base model
lmsys/vicuna-7b-v1.5