Links for Reference

Overview

image/png Volcano employs a single LMM to generate initial responses, feedback, and revisions, as well as decisions to accept revisions. It follows a sequential procedure of an iterative critique-revision-decide loop.

Model details

Model type: Volcano-7b is a multimodal self-feedback guided revision model that was fine-tuned by mixing the visual instruction tuning dataset used in LLaVA-v1.5 with multimodal feedback and revision data collected through gpt-3.5-turbo, applied to the vicuna-7b-v1.5 model.

Model date: Volcano-7b was trained in October 2023.

Training dataset

  • 274K multimodal feedback and revision data
  • 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
  • 158K GPT-generated multimodal instruction-following data.
  • 450K academic-task-oriented VQA data mixture.
  • 40K ShareGPT data

You can find here the dataset used to train Volcano, which includes all the aforementioned datasets.

Evaluation dataset

A collection of three multimodal hallucination benchmarks (MMHal-Bench, Pope, GAVIE) and two multimodal understanding benchmarks (MM-Vet, MMBench).

Downloads last month
30
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train kaist-ai/volcano-7b