metadata

license: mit
datasets:
  - liuhaotian/LLaVA-Pretrain
  - liuhaotian/LLaVA-Instruct-150K
language:
  - en
  - zh
library_name: transformers

WORK IN PROGRESS

Model type

TinyLLaVA, a tiny model (1.4B) trained using the exact training recipe of LLaVA-1.5. We trained our TinyLLaVA using TinyLlama as our LLM backbone, and clip-vit-large-patch14-336 as our vision backbone.

Model Performance

We have evaluated TinyLLaVA on GQA, VizWiz, VQAv2, TextVQA and SQA.

Model	VQAv2	GQA	SQA	TextVQA	VizWiz
TinyLLaVA-v1-1.4B	73.41	57.54	59.40	46.37	49.56
BLIP-2	41.00	41.00	61.00	42.50	19.60
LLaVA-v1.5-7B	78.50	62.00	66.80	61.3	50
LLaVA-v1.5-13B	80.00	63.30	71.60	61.3	53.6
Qwen-VL-7B	78.80	59.30	67.10	63.8	35.2
Qwen-VL-13B	78.20	57.50	68.20	61.5	38.9

More evaluations are ongoing.

Model use

The weights have been converted to hf format.

How to use the model

First, make sure to have transformers >= 4.35.3. The model supports multi-image and multi-prompt generation. Meaning that you can pass multiple images in your prompt. Make sure also to follow the correct prompt template (USER: xxx\nASSISTANT:) and add the token <image> to the location where you want to query images:

Using `pipeline`:

Below we used "bczhou/tiny-llava-v1-hf" checkpoint.

from transformers import pipeline
from PIL import Image
import requests
model_id = "bczhou/tiny-llava-v1-hf"
pipe = pipeline("image-to-text", model=model_id)
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "USER: <image>\nWhat does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud\nASSISTANT:"
outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200})
print(outputs[0])
>>> {"generated_text': 'USER:  \nWhat does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud\nASSISTANT: The label 15 represents lava, which is a type of volcanic rock."}

Using pure `transformers`:

Below is an example script to run generation in float16 precision on a GPU device:

import requests
from PIL import Image
import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration
model_id = "bczhou/tiny-llava-v1-hf"
prompt = "USER: <image>\nWhat are these?\nASSISTANT:"
image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
model = LlavaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
).to(0)
processor = AutoProcessor.from_pretrained(model_id)
raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(prompt, raw_image, return_tensors='pt').to(0, torch.float16)
output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))