--- language: - en --- # Emu2-Chat [Paper](https://arxiv.org/abs/2312.13286) | [🤗HF Demo](https://huggingface.co/spaces/BAAI/Emu2) | [Demo](https://emu.ssi.plus) | [Project Page](https://baaivision.github.io/emu2/) | [Github](https://github.com/baaivision/Emu) ## Model Weights | Model name | Weight | | ------------------ | ------------------------------------------------------- | | **Emu2** | [🤗 HF link](https://huggingface.co/BAAI/Emu2) | | **Emu2-Chat** | [🤗 HF link](https://huggingface.co/BAAI/Emu2-Chat) | | **Emu2-Gen** | [🤗 HF link](https://huggingface.co/BAAI/Emu2-Gen) | ## Inference (Huggingface Version) #### Single GPU ```python from PIL import Image import requests import torch from transformers import AutoModelForCausalLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("BAAI/Emu2-Chat") model = AutoModelForCausalLM.from_pretrained( "BAAI/Emu2-Chat", torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, trust_remote_code=True).to('cuda').eval() # `[]` is the image placeholder which will be replaced by image embeddings. # the number of `[]` should be equal to the number of input images query = '[]Describe the image in details:' image = Image.open(requests.get('https://github.com/baaivision/Emu/Emu2/examples/blue_black_1_top_left.jpg?raw=true',stream=True).raw).convert('RGB') inputs = model.build_input_ids( text=[query], tokenizer=tokenizer, image=[image] ) with torch.no_grad(): outputs = model.generate( input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], image=inputs["image"].to(torch.bfloat16), max_new_tokens=64, length_penalty=-1) output_text = tokenizer.batch_decode(outputs, skip_special_tokens=True) ``` Interleaved image and text ```python from PIL import Image import requests import torch from transformers import AutoModelForCausalLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("BAAI/Emu2-Chat") model = AutoModelForCausalLM.from_pretrained( "BAAI/Emu2-Chat", torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, trust_remote_code=True).to('cuda').eval() # `[]` is the image placeholder which will be replaced by image embeddings. # the number of `[]` should be equal to the number of input images query = "[][red, white, 3, bottom left].[][yellow, white, 2, top left].[][green, black, 4, bottom right][]" images = [ Image.open(requests.get('https://github.com/baaivision/Emu/Emu2/examples/red_white_3_bottom_left.jpg?raw=true',stream=True).raw).convert('RGB'), Image.open(requests.get('https://github.com/baaivision/Emu/Emu2/examples/yellow_white_2_top_right.jpg?raw=true',stream=True).raw).convert('RGB'), Image.open(requests.get('https://github.com/baaivision/Emu/Emu2/examples/green_black_4_bottom_right.jpg?raw=true',stream=True).raw).convert('RGB'), Image.open(requests.get('https://github.com/baaivision/Emu/Emu2/examples/blue_black_1_top_left.jpg?raw=true',stream=True).raw).convert('RGB'), ] inputs = model.build_input_ids( text=[query], tokenizer=tokenizer, image=images ) with torch.no_grad(): outputs = model.generate( input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], image=inputs["image"].to(torch.bfloat16), max_new_tokens=64, length_penalty=-1) output_text = tokenizer.batch_decode(outputs, skip_special_tokens=True) ``` #### Multi GPU ```python from PIL import Image import requests import torch from transformers import AutoModelForCausalLM, AutoTokenizer from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_and_dispatch tokenizer = AutoTokenizer.from_pretrained("BAAI/Emu2-Chat") with init_empty_weights(): model = AutoModelForCausalLM.from_pretrained( "BAAI/Emu2-Chat", torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, trust_remote_code=True) device_map = infer_auto_device_map(model, max_memory={0:'38GiB',1:'38GiB',}, no_split_module_classes=['Block','LlamaDecoderLayer']) # input and output logits should be on same device device_map["model.decoder.lm.lm_head"] = 0 model = load_checkpoint_and_dispatch( model, 'local/path/to/hf/version/Emu2-Chat/model', device_map=device_map).eval() # `[]` is the image placeholder which will be replaced by image embeddings. # the number of `[]` should be equal to the number of input images query = '[]Describe the image in details:' image = Image.open(requests.get('https://github.com/baaivision/Emu/Emu2/examples/blue_black_1_top_left.jpg?raw=true',stream=True).raw).convert('RGB') inputs = model.build_input_ids( text=[query], tokenizer=tokenizer, image=[image] ) with torch.no_grad(): outputs = model.generate( input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], image=inputs["image"].to(torch.bfloat16), max_new_tokens=64, length_penalty=-1) output_text = tokenizer.batch_decode(outputs, skip_special_tokens=True) ``` Interleaved image and text ```python from PIL import Image import requests import torch from transformers import AutoModelForCausalLM, AutoTokenizer from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_and_dispatch tokenizer = AutoTokenizer.from_pretrained("BAAI/Emu2-Chat") with init_empty_weights(): model = AutoModelForCausalLM.from_pretrained( "BAAI/Emu2-Chat", torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, trust_remote_code=True) device_map = infer_auto_device_map(model, max_memory={0:'38GiB',1:'38GiB',}, no_split_module_classes=['Block','LlamaDecoderLayer']) # input and output logits should be on same device device_map["model.decoder.lm.lm_head"] = 0 model = load_checkpoint_and_dispatch( model, 'local/path/to/hf/version/Emu2-Chat/model', device_map=device_map).eval() # `[]` is the image placeholder which will be replaced by image embeddings. # the number of `[]` should be equal to the number of input images query = "[][red, white, 3, bottom left].[][yellow, white, 2, top left].[][green, black, 4, bottom right][]" images = [ Image.open(requests.get('https://github.com/baaivision/Emu/Emu2/examples/red_white_3_bottom_left.jpg?raw=true',stream=True).raw).convert('RGB'), Image.open(requests.get('https://github.com/baaivision/Emu/Emu2/examples/yellow_white_2_top_right.jpg?raw=true',stream=True).raw).convert('RGB'), Image.open(requests.get('https://github.com/baaivision/Emu/Emu2/examples/green_black_4_bottom_right.jpg?raw=true',stream=True).raw).convert('RGB'), Image.open(requests.get('https://github.com/baaivision/Emu/Emu2/examples/blue_black_1_top_left.jpg?raw=true',stream=True).raw).convert('RGB'), ] inputs = model.build_input_ids( text=[query], tokenizer=tokenizer, image=images ) with torch.no_grad(): outputs = model.generate( input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], image=inputs["image"].to(torch.bfloat16), max_new_tokens=64, length_penalty=-1) output_text = tokenizer.batch_decode(outputs, skip_special_tokens=True) ``` #### Quantization Check quantization guidance at [transformers](https://huggingface.co/docs/transformers/v4.28.0/main_classes/quantization) ```python from PIL import Image import requests import torch from transformers import AutoModelForCausalLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("BAAI/Emu2-Chat") model = AutoModelForCausalLM.from_pretrained( "BAAI/Emu2-Chat", load_in_4bit=True, trust_remote_code=True, bnb_4bit_compute_dtype=torch.float16).eval() query = '[]Describe the image in details:' image = Image.open(requests.get('https://github.com/baaivision/Emu/Emu2/examples/blue_black_1_top_left.jpg?raw=true',stream=True).raw).convert('RGB') inputs = model.build_input_ids( text=[query], tokenizer=tokenizer, image=[image] ) with torch.no_grad(): outputs = model.generate( input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], image=inputs["image"].to(torch.float16), # should be torch.float16 max_new_tokens=64, length_penalty=-1) output_text = tokenizer.batch_decode(outputs, skip_special_tokens=True) ``` ## Citation If you find Emu2 useful for your research and applications, please consider starring this repository and citing: ``` @article{Emu2, title={Generative Multimodal Models are In-Context Learners}, author={Quan Sun and Yufeng Cui and Xiaosong Zhang and Fan Zhang and Qiying Yu and Zhengxiong Luo and Yueze Wang and Yongming Rao and Jingjing Liu and Tiejun Huang and Xinlong Wang}, publisher={arXiv preprint arXiv:2312.13286}, year={2023}, } ```