1S-Lab, Nanyang Technological University
2Microsoft Research, Redmond
## 🦦 Simple Code For Otter-9B
Here is an example of multi-modal ICL (in-context learning) with 🦦 Otter. We provide two demo images with corresponding instructions and answers, then we ask the model to generate an answer given our instruct. You may change your instruction and see how the model responds.
``` python
import requests
import torch
import transformers
from PIL import Image
from otter.modeling_otter import OtterForConditionalGeneration
model = OtterForConditionalGeneration.from_pretrained(
"luodian/otter-9b-hf", device_map="auto"
)
tokenizer = model.text_tokenizer
image_processor = transformers.CLIPImageProcessor()
demo_image_one = Image.open(
requests.get(
"http://images.cocodataset.org/val2017/000000039769.jpg", stream=True
).raw
)
demo_image_two = Image.open(
requests.get(
"http://images.cocodataset.org/test-stuff2017/000000028137.jpg", stream=True
).raw
)
query_image = Image.open(
requests.get(
"http://images.cocodataset.org/test-stuff2017/000000028352.jpg", stream=True
).raw
)
vision_x = (
image_processor.preprocess(
[demo_image_one, demo_image_two, query_image], return_tensors="pt"
)["pixel_values"]
.unsqueeze(1)
.unsqueeze(0)
)
model.text_tokenizer.padding_side = "left"
lang_x = model.text_tokenizer(
[
"
User: what does the image describe? GPT: two cats sleeping. <|endofchunk|> User: what does the image describe? GPT: a bathroom sink. <|endofchunk|> User: what does the image describe? GPT: "
],
return_tensors="pt",
)
generated_text = model.generate(
vision_x=vision_x.to(model.device),
lang_x=lang_x["input_ids"].to(model.device),
attention_mask=lang_x["attention_mask"].to(model.device),
max_new_tokens=256,
num_beams=1,
no_repeat_ngram_size=3,
)
print("Generated text: ", model.text_tokenizer.decode(generated_text[0]))
```