mPLUG
/

mPLUG-Owl3-1B-241014

Visual Question Answering

Model card Files Files and versions Community

mPLUG-Owl3-1B-241014 / README.md

Mizukiluke's picture

Upload 16 files

b51625d verified 4 months ago

|

history blame contribute delete

3.73 kB

	---
	license: apache-2.0
	language:
	- en
	pipeline_tag: visual-question-answering
	tags:
	- chat
	---

	# mPLUG-Owl3

	## Introduction
	mPLUG-Owl3 is a state-of-the-art multi-modal large language model designed to tackle the challenges of long image sequence understanding. We propose Hyper Attention, which boosts the speed of long visual sequence understanding in multimodal large language models by sixfold, allowing for processing of visual sequences that are eight times longer. Meanwhile, we maintain excellent performance on single-image, multi-image, and video tasks.

	Github: [mPLUG-Owl](https://github.com/X-PLUG/mPLUG-Owl)

	## Quickstart

	Load the mPLUG-Owl3. We now only support attn_implementation in ```['sdpa', 'flash_attention_2']```.
	```Python


	import torch
	from transformers import AutoConfig, AutoModel
	model_path = 'mPLUG/mPLUG-Owl3-1B-241014'
	config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
	print(config)
	# model = mPLUGOwl3Model(config).cuda().half()
	model = AutoModel.from_pretrained(model_path, attn_implementation='sdpa', torch_dtype=torch.half, trust_remote_code=True)
	model.eval().cuda()
	```
	Chat with images.
	```Python
	from PIL import Image

	from transformers import AutoTokenizer, AutoProcessor
	from decord import VideoReader, cpu
	model_path = 'mPLUG/mPLUG-Owl3-1B-241014'
	tokenizer = AutoTokenizer.from_pretrained(model_path)
	processor = model.init_processor(tokenizer)

	image = Image.new('RGB', (500, 500), color='red')

	messages = [
	{"role": "user", "content": """<\|image\|>
	Describe this image."""},
	{"role": "assistant", "content": ""}
	]

	inputs = processor(messages, images=[image], videos=None)

	inputs.to('cuda')
	inputs.update({
	'tokenizer': tokenizer,
	'max_new_tokens':100,
	'decode_text':True,
	})


	g = model.generate(**inputs)
	print(g)
	```

	Chat with a video.
	```Python
	from PIL import Image

	from transformers import AutoTokenizer, AutoProcessor
	from decord import VideoReader, cpu # pip install decord
	model_path = 'mPLUG/mPLUG-Owl3-1B-241014'
	tokenizer = AutoTokenizer.from_pretrained(model_path)
	processor = model.init_processor(tokenizer)


	messages = [
	{"role": "user", "content": """<\|video\|>
	Describe this video."""},
	{"role": "assistant", "content": ""}
	]

	videos = ['/nas-mmu-data/examples/car_room.mp4']

	MAX_NUM_FRAMES=16

	def encode_video(video_path):
	def uniform_sample(l, n):
	gap = len(l) / n
	idxs = [int(i * gap + gap / 2) for i in range(n)]
	return [l[i] for i in idxs]

	vr = VideoReader(video_path, ctx=cpu(0))
	sample_fps = round(vr.get_avg_fps() / 1) # FPS
	frame_idx = [i for i in range(0, len(vr), sample_fps)]
	if len(frame_idx) > MAX_NUM_FRAMES:
	frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
	frames = vr.get_batch(frame_idx).asnumpy()
	frames = [Image.fromarray(v.astype('uint8')) for v in frames]
	print('num frames:', len(frames))
	return frames
	video_frames = [encode_video(_) for _ in videos]
	inputs = processor(messages, images=None, videos=video_frames)

	inputs.to('cuda')
	inputs.update({
	'tokenizer': tokenizer,
	'max_new_tokens':100,
	'decode_text':True,
	})


	g = model.generate(**inputs)
	print(g)
	```


	## Citation

	If you find our work helpful, feel free to give us a cite.

	```
	@misc{ye2024mplugowl3longimagesequenceunderstanding,
	title={mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models},
	author={Jiabo Ye and Haiyang Xu and Haowei Liu and Anwen Hu and Ming Yan and Qi Qian and Ji Zhang and Fei Huang and Jingren Zhou},
	year={2024},
	eprint={2408.04840},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2408.04840},
	}
	```