Edit model card

LLaVA-Gemma Model Card

This model card corresponds to the 2B version of the model with the CLIP-based vision encoder.

Preprint: arxiv.org/abs/2404.01331

Overview

llava-gemma-2b is a large multimodal model (LMM) trained using the LLaVA-v1.5 framework with the 2-billion parameter google/gemma-2b-it model as language backbone.

Uses

The model has been finetuned for multimodal benchmark evaluations, but can also be used as a multimodal chatbot.

Bias, Risks, and Limitations

This model has not been assessed for harm or biases, and should not be used for sensitive applications where it may cause harm.

How to Get Started with the Model

Currently using llava-gemma requires a modified preprocessor.

We are currently working on modifying the LlavaProcessor class to streamline usage (see PR #30030), expect updates soon.

For current usage, see usage.py or the following code block:

import requests
from PIL import Image
from transformers import (
  LlavaForConditionalGeneration,
  AutoTokenizer,
  CLIPImageProcessor
)
from processing_llavagemma import LlavaGemmaProcessor # This is in this repo

checkpoint = "Intel/llava-gemma-2b"

# Load model
model = LlavaForConditionalGeneration.from_pretrained(checkpoint)
processor = LlavaGemmaProcessor(
    tokenizer=AutoTokenizer.from_pretrained(checkpoint),
    image_processor=CLIPImageProcessor.from_pretrained(checkpoint)
)

# Prepare inputs
# Use gemma chat template
prompt = processor.tokenizer.apply_chat_template(
    [{'role': 'user', 'content': "<image>\nWhat's the content of the image?"}],
    tokenize=False,
    add_generation_prompt=True
)
url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors="pt")

# Generate
generate_ids = model.generate(**inputs, max_length=30)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(output)

Training Details

The llava-gemma-2b model was trained on 8 Gaudi 2 accelerators.

Training Data

The model was trained using the LLaVA-v1.5 data mixture.

This is listed as follows:

  • 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
  • 158K GPT-generated multimodal instruction-following data.
  • 450K academic-task-oriented VQA data mixture.
  • 40K ShareGPT data.

Evaluation

LM Backbone Vision Model Pretrained Connector GQA MME cognition MME perception MM-Vet POPE accuracy POPE F1 VQAv2 TextVQA ScienceQA Image MMVP
gemma-2b-it CLIP Yes 0.531 236.071 1130.492 17.706 0.850 0.839 70.65 28.06 0.564 0.287
gemma-2b-it CLIP No 0.481 247.857 934.611 13.119 0.784 0.762 61.74 0.549 0.180
gemma-7b-it CLIP Yes 0.472 253.571 894.910 18.165 0.848 0.829 68.7 0.625 0.327
gemma-7b-it CLIP No 0.472 278.214 857.274 19.083 0.782 0.734 65.09 0.636 0.240
gemma-2b-it DinoV2 Yes 0.587 307.143 1132.970 19.128 0.853 0.838 71.37 12.53 0.555 0.227
gemma-2b-it DinoV2 No 0.501 308.929 959.351 14.541 0.793 0.772 61.65 11.1 0.568 0.180
Downloads last month
5,284
Safetensors
Model size
2.82B params
Tensor type
F32
·
Unable to determine this model’s pipeline type. Check the docs .