metadata
language:
- en
- de
- fr
- it
- pt
- hi
- es
- th
library_name: transformers
pipeline_tag: image-text-to-text
tags:
- facebook
- meta
- pytorch
- llama
- llama-3
This repository is an pre-release checkpoint for Llama 3.2 11B Vision.
It contains two versions of the model, for use with transformers
and with the original llama3
codebase (under the original
directory).
Inference with transformers
Please, install the in-progress development wheel from https://huggingface.co/nltpt/transformers/tree/main.
This is an example inference snippet (API subject to change):
import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor
model_id = "nltpt/Llama-3.2-11B-Vision"
model = MllamaForConditionalGeneration.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained(model_id)
prompt = "<|image|><|begin_of_text|>If I had to write a haiku for this one"
url = "https://llava-vl.github.io/static/images/view.jpg"
raw_image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=prompt, images=raw_image, return_tensors="pt").to(model.device)
output = model.generate(**inputs, do_sample=False, max_new_tokens=25)
print(processor.decode(output[0], skip_special_tokens=True))
Output:
If I had to write a haiku for this one, it would be:.\nA dock on a lake.\nA mountain in the distance.\nA long exposure.\
Running the original checkpoints
The package installed will provide three binaries:
- example_chat_completion
- example_text_completion
- multimodal_example_chat_completion
You can invoke them via torchrun by doing the following:
CHECKPOINT_DIR=~/.llama/checkpoints/Llama-3.2-11B-Vision/
torchrun `which multimodal_example_chat_completion` "$CHECKPOINT_DIR"
You can study the code for the script by doing something like:
PACKAGE_DIR=$(pip show -f llama-models | grep Location | awk '{ print $2 }')
echo "Scripts are in the directory: $PACKAGE_DIR/llama-models/scripts/"