LLaVA-Llama3 Model Card

This model card corresponds to the instruction tuned 8B version of the model with the CLIP-based vision encoder.

Overview

llava-llama-3-8b is a large multimodal model (LMM) trained using the LLaVA-v1.5 framework with the 8-billion parameter meta-llama/Meta-Llama-3-8B-Instruct model as language backbone.

Uses

The model has been finetuned for multimodal benchmark evaluations, but can also be used as a multimodal chatbot.

Bias, Risks, and Limitations

This model has not been assessed for harm or biases, and should not be used for sensitive applications where it may cause harm.

Training Details

The llava-llama-3-8b model was trained on a 4 node cluster with a total of 32 Gaudi 2 accelerators.

Training Data

The model was trained using the LLaVA-v1.5 data mixture.

This is listed as follows:

558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
158K GPT-generated multimodal instruction-following data.
450K academic-task-oriented VQA data mixture.
40K ShareGPT data.

Evaluation

Model	Metrics
ScienceQA	72.9797
MMVet	31.9725
llavaw	56.9/61.9/73.6/65.7
Pope Acc	87.33, F1 86.5
GQA	60.6138
MMVP	36

License

The weights are released under the Intel Research Use License Agreement (see LICENSE file)
All usage code is licensed Apache 2.0

Usage

Please note, we only provide the trained weights difference and do not provide a copy of the base meta-llama/Meta-Llama-3-8B-Instruct model. Any use of these weights requires a separate download of the base model.

# Copyright 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import requests
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForPreTraining
import transformers

def expand2square(pil_img, background_color):
    width, height = pil_img.size
    if width == height:
        return pil_img
    elif width > height:
        result = Image.new(pil_img.mode, (width, width), background_color)
        result.paste(pil_img, (0, (width - height) // 2))
        return result
    else:
        result = Image.new(pil_img.mode, (height, height), background_color)
        result.paste(pil_img, ((height - width) // 2, 0))
        return result

def add_model_a_to_b(model_a, model_b):
    state_dict_a = model_a.state_dict()
    state_dict_b = model_b.state_dict()

    # Ensure keys match before subtraction
    if set(state_dict_a.keys()) != set(state_dict_b.keys()):
        raise ValueError("Model state dicts do not have the same keys.")

    for key in state_dict_a:
        if state_dict_a[key].shape != state_dict_b[key].shape:
            raise ValueError(f"Shape mismatch for key '{key}': {state_dict_a[key].shape} vs {state_dict_b[key].shape}")
        # Subtract model_a's weights from model_b for the matching key
        state_dict_b[key] = state_dict_b[key] + state_dict_a[key]

    # Update model_b with the new weights
    model_b.load_state_dict(state_dict_b)

output_checkpoint = "" # set if you don't want to merge every time
hf_checkpoint = "Intel/llava-llama-3-8b"
device = "cuda" if torch.cuda.is_available() else "cpu"

processor = AutoProcessor.from_pretrained(hf_checkpoint)
model = AutoModelForPreTraining.from_pretrained(hf_checkpoint)
if model.language_model.model.embed_tokens.weight[-1].sum() == 0:
    print("adding llama3 weights")
    model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
    pipeline = transformers.pipeline(
        "text-generation",
        model=model_id,
        model_kwargs={"torch_dtype": torch.bfloat16},
        device_map="cpu",
    )
    llama3 = pipeline.model
    add_model_a_to_b(llama3, model.language_model)
    if output_checkpoint:
        print("saving weights, so no adding is needed again")
        model.save_pretrained(output_checkpoint)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

prompt = processor.tokenizer.apply_chat_template(
    [{'role': 'user', 'content': "<image>\nWhat's the content of the image?"}],
    tokenize=False,
    add_generation_prompt=True
)

url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = Image.open(requests.get(url, stream=True).raw)

#original llava pads with mean, HF llava pads with zeros
image = expand2square(image, tuple(int(x*255) for x in processor.image_processor.image_mean)) 
inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
# Generate
generate_ids = model.generate(**inputs, max_length=30)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(output)