amitha
/

mllava-baichuan2-en

Visual Question Answering

Model card Files Files and versions Community

mllava-baichuan2-en / README.md

amitha's picture

Update README.md

66b3372 verified about 1 month ago

|

history blame contribute delete

No virus

2.51 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- llava
	- vlm
	---

	The English Baichuan2-7B-Chat VLM trained via LORA for [See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding](https://arxiv.org/abs/2406.11665).

	Vision Encoder: [CLIP-L](https://huggingface.co/openai/clip-vit-large-patch14-336)

	Base LLM: [Baichuan2-7B-Chat](https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat)

	Training Corpus:
	- alignment: the corpus used by [LLAVA](https://proceedings.neurips.cc/paper_files/paper/2023/hash/6dcf277ea32ce3288914faf369fe6de0-Abstract-Conference.html)
	- visual instruction tuning: the corpus used by [LLAVA](https://proceedings.neurips.cc/paper_files/paper/2023/hash/6dcf277ea32ce3288914faf369fe6de0-Abstract-Conference.html)

	Alignment Script: https://github.com/amith-ananthram/mLLaVA/blob/main/scripts/v1_5/pretrain.sh

	Visual Instruction Tuning Script: https://github.com/amith-ananthram/mLLaVA/blob/main/scripts/v1_5/finetune_lora.sh

	Usage Example:

	import torch
	from PIL import Image
	from transformers import AutoTokenizer, AutoModelForVisualQuestionAnswering

	# from constants.py, utils.py, included as files in this HF release
	from constants import IMAGE_TOKEN_INDEX
	from utils import tokenizer_image_token, process_images

	device = torch.device('cuda')

	# load model and vision tower
	model = AutoModelForVisualQuestionAnswering.from_pretrained('amitha/mllava.baichuan2-en', trust_remote_code=True)
	model.model.vision_tower.load_model()
	model = model.eval().to(device)

	image_processor = model.get_vision_tower().image_processor
	tokenizer = AutoTokenizer.from_pretrained('baichuan-inc/Baichuan2-7B-Chat', trust_remote_code=True)

	prompt = '<reserved_106><image>\nPlease describe this image.<reserved_107>'

	input_ids = tokenizer_image_token(
	prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt"
	)
	with Image.open("path/to/image.png") as img:
	images = process_images(
	[img.convert('RGB')], image_processor, model.config
	).to(dtype=torch.float16)
	image_sizes = [img.size]

	with torch.no_grad():
	output = model.generate(
	inputs=input_ids.unsqueeze(dim=0).to(device),
	attention_mask=torch.ones(input_ids.shape[0]).unsqueeze(dim=0).to(device),
	images=images.to(device),
	image_sizes=image_sizes
	)

	print(tokenizer.batch_decode(output, skip_special_tokens=True))