Edit model card

Model Details: LLaVA-Gemma-2b

llava-gemma-2b is a large multimodal model (LMM) trained using the LLaVA-v1.5 framework with the 2-billion parameter google/gemma-2b-it model as language backbone and the CLIP-based vision encoder.

Model Details Description
Authors Intel: Musashi Hinck, Matthew Olson, David Cobbley, Shao-Yen Tseng, Vasudev Lal
Date March 2024
Version 1
Type Large multimodal model (LMM)
Paper or Other Resources LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model
License Gemma
Questions or Comments Community Tab and Intel DevHub Discord

This model card was created by Benjamin Consolvo and the authors listed above.

Intended Use

Intended Use Description
Primary intended uses The model has been finetuned for multimodal benchmark evaluations, but can also be used as a multimodal chatbot.
Primary intended users Anyone using or evaluating multimodal models.
Out-of-scope uses This model is not intended for uses that require high levels of factuality, high stakes situations, mental health or medical applications, generating misinformation or disinformation, impersonating others, facilitating or inciting harassment or violence, any use that could lead to the violation of a human right under the UN Declaration of Human Rights.

How to use

Currently, using llava-gemma requires a modified preprocessor. We are currently working on modifying the LlavaProcessor class to streamline usage (see PR #30030). Expect updates soon.

For current usage, see usage.py or the following code block:

import requests
from PIL import Image
from transformers import (
  LlavaForConditionalGeneration,
  AutoTokenizer,
  CLIPImageProcessor
)
from processing_llavagemma import LlavaGemmaProcessor # This is in this repo

checkpoint = "Intel/llava-gemma-2b"

# Load model
model = LlavaForConditionalGeneration.from_pretrained(checkpoint)
processor = LlavaGemmaProcessor(
    tokenizer=AutoTokenizer.from_pretrained(checkpoint),
    image_processor=CLIPImageProcessor.from_pretrained(checkpoint)
)

# Prepare inputs
# Use gemma chat template
prompt = processor.tokenizer.apply_chat_template(
    [{'role': 'user', 'content': "<image>\nWhat's the content of the image?"}],
    tokenize=False,
    add_generation_prompt=True
)
url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors="pt")

# Generate
generate_ids = model.generate(**inputs, max_length=30)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(output)

For straightforward use as a chatbot (without images), you can modify the last portion of code to the following:

# Prepare inputs
# Use gemma chat template
prompt = processor.tokenizer.apply_chat_template(
    [{'role': 'user', 'content': "Summarize the following paragraph? In this paper, we introduced LLaVA-Gemma, a compact vision-language model leveraging the Gemma Large Language Model in two variants, Gemma-2B and Gemma-7B. Our work provides a unique opportunity for researchers to explore the trade-offs between computational efficiency and multimodal understanding in small-scale models. The availability of both variants allows for a comparative analysis that sheds light on how model size impacts performance in various tasks. Our evaluations demonstrate the versatility and effectiveness of LLaVA-Gemma across a range of datasets, highlighting its potential as a benchmark for future research in small-scale vision-language models. With these models, future practitioners can optimize the performance of small-scale multimodal models more directly."}],
    tokenize=False,
    add_generation_prompt=True
)
# url = "https://www.ilankelman.org/stopsigns/australia.jpg"
# image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=prompt, images=None, return_tensors="pt")

# Generate
generate_ids = model.generate(**inputs, max_length=300)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(output)

Factors

Factors Description
Groups -
Instrumentation -
Environment Trained for 4 hours on 8 Intel Gaudi 2 AI accelerators.
Card Prompts Model training and deployment on alternate hardware and software will change model performance

Metrics

Metrics Description
Model performance measures We evaluate the LlaVA-Gemma models on a similar collection of benchmarks to other LMM works: GQA; MME; MM-Vet; POPE (accuracy and F1); VQAv2; MMVP; the image subset of ScienceQA. Our experiments provide insights into the efficacy of various design choices within the LLaVA framework.
Decision thresholds -
Approaches to uncertainty and variability -

Training Data

The model was trained using the LLaVA-v1.5 data mixture. This is listed as follows:

  • 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
  • 158K GPT-generated multimodal instruction-following data.
  • 450K academic-task-oriented VQA data mixture.
  • 40K ShareGPT data.

Quantitative Analyses

Performance of LLaVA-Gemma models across seven benchmarks. Highlighted box indicates strongest performance amongst LLaVA-Gemma models. Bottom two rows show self-reported performance of Llava Phi-2 and LLaVA-v1.5 respectively. The bolded gemma-2b-it is the current model used here in this model card.

LM Backbone Vision Model Pretrained Connector GQA MME cognition MME perception MM-Vet POPE accuracy POPE F1 VQAv2 ScienceQA Image MMVP
gemma-2b-it CLIP Yes 0.531 236 1130 17.7 0.850 0.839 70.65 0.564 0.287
gemma-2b-it CLIP No 0.481 248 935 13.1 0.784 0.762 61.74 0.549 0.180
gemma-2b-it DinoV2 Yes 0.587 307 1133 19.1 0.853 0.838 71.37 0.555 0.227
gemma-2b-it DinoV2 No 0.501 309 959 14.5 0.793 0.772 61.65 0.568 0.180
gemma-7b-it CLIP Yes 0.472 253 895 18.2 0.848 0.829 68.7 0.625 0.327
gemma-7b-it CLIP No 0.472 278 857 19.1 0.782 0.734 65.1 0.636 0.240
gemma-7b-it DinoV2 Yes 0.519 257 1021 14.3 0.794 0.762 65.2 0.628 0.327
gemma-7b-it DinoV2 No 0.459 226 771 12.2 0.693 0.567 57.4 0.598 0.267
Phi-2b CLIP Yes - - 1335 28.9 - 0.850 71.4 0.684 -
Llama-2-7b CLIP Yes 0.620 348 1511 30.6 0.850 0.859 78.5 0.704 46.1

Ethical Considerations

Intel is committed to respecting human rights and avoiding causing or contributing to adverse impacts on human rights. See Intel’s Global Human Rights Principles. Intel’s products and software are intended only to be used in applications that do not cause or contribute to adverse impacts on human rights.

Ethical Considerations Description
Data The model was trained using the LLaVA-v1.5 data mixture as described above.
Human life The model is not intended to inform decisions central to human life or flourishing.
Mitigations No additional risk mitigation strategies were considered during model development.
Risks and harms This model has not been assessed for harm or biases, and should not be used for sensitive applications where it may cause harm.
Use cases -

Caveats and Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

Citation details

@misc{hinck2024llavagemma,
      title={LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model}, 
      author={Musashi Hinck and Matthew L. Olson and David Cobbley and Shao-Yen Tseng and Vasudev Lal},
      year={2024},
      eprint={2404.01331},
      url={https://arxiv.org/abs/2404.01331},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Downloads last month
4,562
Safetensors
Model size
2.82B params
Tensor type
F32
·
Inference API (serverless) does not yet support transformers models for this pipeline type.

Finetuned from

Evaluation results