Update README.md

ed4ddb6 over 1 year ago

5.29 kB

	---
	license: mit
	language:
	- en
	library_name: transformers
	---

	# Model Card for MMICL

	# News 🚀
	1. [09-19] We have converted the MMICL demo to a permanent link: [Demo for MMICL](http://www.testmmicl.work). The Vicuna version of MMICL and Chat Mode are presently under development, so they may require careful adjustment of generation parameters and may not work correctly.
	2. [09-15] Our [paper](https://arxiv.org/abs/2309.07915) has been uploaded to arXiv.
	3. [09-01] The [MIC](https://huggingface.co/datasets/BleachNick/MIC_full) data has released on the huggingface hub.
	4. [08-23] Reach the 1st on [MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation), 1st on [MMBench](https://opencompass.org.cn/leaderboard-multimodal)
	5. [08-21] The [MMICL-FLANT5XXL](https://huggingface.co/BleachNick/MMICL-Instructblip-T5-xxl) and [MMICL-Tiny](https://huggingface.co/BleachNick/MMICL-Instructblip-T5-xl) model has released on the huggingface hub.

	## Temporal Demo for MMICL
	[Playground for MMICL-FLANT5XXL](http://www.testmmicl.work/)
	support multi-image input as well as video input.
	<!-- Provide a quick summary of what the model is/does. -->

	## Model Details
	MMICL(Multi-Modal In-Context Learning) is a multimodal vision-language model that incorporates blip2/instrcutblip.
	It has the ability to analyze and understand multiple images, as well as follow instructions.


	### Model Description
	MMICL outperforms the VL model of the same size and performs exceptionally well on complex visual reasoning datasets.
	Till 21st Aug. 2023, it achieves state-of-the-art performance on both multimodal task leaderboards and a wide range of vision-language tasks.
	Furthermore, it showcases new capabilities in video understanding and multimodal in-context learning (M-ICL).
	+ Capability of multiple images refering and reasoning

	+ Manually constructed In-context instruction tuning dataset

	+ Till 21st Aug. 2023 1st on [MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation), 1st on [MMBench](https://opencompass.org.cn/leaderboard-multimodal)

	+ Visual Encoder: VIT-L from CLIP/ ViT-G/14 from EVA-CLIP

	+ Pre-trained LLM: FlanT5-XL/ FlanT5-XXL/ Vicuna-7B/ Vicuna-13B
	<!-- Provide a longer summary of what this model is. -->



	- Developed by: [More Information Needed]
	- License: MIT
	- Finetuned from model : [instructblip-flan-t5-xxl](https://huggingface.co/Salesforce/instructblip-flan-t5-xxl)

	<!-- Provide the basic links for the model. -->

	- Repository: [MMICL](https://github.com/HaozheZhao/MIC)


	## How to Get Started with the Model
	the images are shown in our github repo [MMICL](https://github.com/HaozheZhao/MIC)
	```
	# For T5 based model
	from model.instructblip import InstructBlipConfig, InstructBlipModel, InstructBlipPreTrainedModel,InstructBlipForConditionalGeneration,InstructBlipProcessor
	import datasets
	import json
	import transformers
	from PIL import Image
	import torch
	model_type="instructblip"
	model_ckpt="BleachNick/MMICL-Instructblip-T5-xxl"
	processor_ckpt = "Salesforce/instructblip-flan-t5-xxl"
	config = InstructBlipConfig.from_pretrained(model_ckpt )

	if 'instructblip' in model_type:
	model = InstructBlipForConditionalGeneration.from_pretrained(
	model_ckpt,
	config=config).to('cuda:0',dtype=torch.bfloat16)

	image_palceholder="图"
	sp = [image_palceholder]+[f"<image{i}>" for i in range(20)]
	processor = InstructBlipProcessor.from_pretrained(
	processor_ckpt
	)
	sp = sp+processor.tokenizer.additional_special_tokens[len(sp):]
	processor.tokenizer.add_special_tokens({'additional_special_tokens':sp})
	if model.qformer.embeddings.word_embeddings.weight.shape[0] != len(processor.qformer_tokenizer):
	model.qformer.resize_token_embeddings(len(processor.qformer_tokenizer))
	replace_token="".join(32*[image_palceholder])


	image = Image.open ("images/cal_num1.png")
	image1 = Image.open ("images/cal_num2.png")
	image2 = Image.open ("images/cal_num3.png")
	images = [image,image1,image2]

	prompt = [f'Use the image 0: <image0>{replace_token},image 1: <image1>{replace_token} and image 2: <image2>{replace_token} as a visual aid to help you calculate the equation accurately. image 0 is 2+1=3.\nimage 1 is 5+6=11.\nimage 2 is"']
	prompt = " ".join(prompt)

	inputs = processor(images=images, text=prompt, return_tensors="pt")

	inputs['pixel_values'] = inputs['pixel_values'].to(torch.bfloat16)
	inputs['img_mask'] = torch.tensor([[1 for i in range(len(images))]])
	inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)

	inputs = inputs.to('cuda:0')
	outputs = model.generate(
	pixel_values = inputs['pixel_values'],
	input_ids = inputs['input_ids'],
	attention_mask = inputs['attention_mask'],
	img_mask = inputs['img_mask'],
	do_sample=False,
	max_length=50,
	min_length=1,
	set_min_padding_size =False,
	)
	generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0].strip()
	print(generated_text)
	# output: 3x6=18"


	```

	####
	Training Hyperparameters

	- Training regime: [fp32, bf16 mixed precision, bf16 non-mixed precision] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->