Bo Li*¹ Yuanhan Zhang*^,1 Liangyu Chen*^,1 Jinghao Wang*^,1 Fanyi Pu*^,1
Jingkang Yang¹ Chunyuan Li² Ziwei Liu¹

¹S-Lab, Nanyang Technological University ²Microsoft Research, Redmond

## 🦦 Simple Code For Otter-9B Here is an example of multi-modal ICL (in-context learning) with 🦦 Otter. We provide two demo images with corresponding instructions and answers, then we ask the model to generate an answer given our instruct. You may change your instruction and see how the model responds. ``` python import requests import torch import transformers from PIL import Image from otter.modeling_otter import OtterForConditionalGeneration model = OtterForConditionalGeneration.from_pretrained( "luodian/otter-9b-hf", device_map="auto" ) tokenizer = model.text_tokenizer image_processor = transformers.CLIPImageProcessor() demo_image_one = Image.open( requests.get( "http://images.cocodataset.org/val2017/000000039769.jpg", stream=True ).raw ) demo_image_two = Image.open( requests.get( "http://images.cocodataset.org/test-stuff2017/000000028137.jpg", stream=True ).raw ) query_image = Image.open( requests.get( "http://images.cocodataset.org/test-stuff2017/000000028352.jpg", stream=True ).raw ) vision_x = ( image_processor.preprocess( [demo_image_one, demo_image_two, query_image], return_tensors="pt" )["pixel_values"] .unsqueeze(1) .unsqueeze(0) ) model.text_tokenizer.padding_side = "left" lang_x = model.text_tokenizer( [ " User: what does the image describe? GPT: two cats sleeping. <|endofchunk|> User: what does the image describe? GPT: a bathroom sink. <|endofchunk|> User: what does the image describe? GPT: " ], return_tensors="pt", ) generated_text = model.generate( vision_x=vision_x.to(model.device), lang_x=lang_x["input_ids"].to(model.device), attention_mask=lang_x["attention_mask"].to(model.device), max_new_tokens=256, num_beams=1, no_repeat_ngram_size=3, ) print("Generated text: ", model.text_tokenizer.decode(generated_text[0])) ```