mllava-baichuan2-en / README.md
amitha's picture
Update README.md
66b3372 verified
---
language:
- en
license: apache-2.0
tags:
- llava
- vlm
---
The English Baichuan2-7B-Chat VLM trained via LORA for [See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding](https://arxiv.org/abs/2406.11665).
Vision Encoder: [CLIP-L](https://huggingface.co/openai/clip-vit-large-patch14-336)
Base LLM: [Baichuan2-7B-Chat](https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat)
Training Corpus:
- alignment: the corpus used by [LLAVA](https://proceedings.neurips.cc/paper_files/paper/2023/hash/6dcf277ea32ce3288914faf369fe6de0-Abstract-Conference.html)
- visual instruction tuning: the corpus used by [LLAVA](https://proceedings.neurips.cc/paper_files/paper/2023/hash/6dcf277ea32ce3288914faf369fe6de0-Abstract-Conference.html)
Alignment Script: https://github.com/amith-ananthram/mLLaVA/blob/main/scripts/v1_5/pretrain.sh
Visual Instruction Tuning Script: https://github.com/amith-ananthram/mLLaVA/blob/main/scripts/v1_5/finetune_lora.sh
Usage Example:
import torch
from PIL import Image
from transformers import AutoTokenizer, AutoModelForVisualQuestionAnswering
# from constants.py, utils.py, included as files in this HF release
from constants import IMAGE_TOKEN_INDEX
from utils import tokenizer_image_token, process_images
device = torch.device('cuda')
# load model and vision tower
model = AutoModelForVisualQuestionAnswering.from_pretrained('amitha/mllava.baichuan2-en', trust_remote_code=True)
model.model.vision_tower.load_model()
model = model.eval().to(device)
image_processor = model.get_vision_tower().image_processor
tokenizer = AutoTokenizer.from_pretrained('baichuan-inc/Baichuan2-7B-Chat', trust_remote_code=True)
prompt = '<reserved_106><image>\nPlease describe this image.<reserved_107>'
input_ids = tokenizer_image_token(
prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt"
)
with Image.open("path/to/image.png") as img:
images = process_images(
[img.convert('RGB')], image_processor, model.config
).to(dtype=torch.float16)
image_sizes = [img.size]
with torch.no_grad():
output = model.generate(
inputs=input_ids.unsqueeze(dim=0).to(device),
attention_mask=torch.ones(input_ids.shape[0]).unsqueeze(dim=0).to(device),
images=images.to(device),
image_sizes=image_sizes
)
print(tokenizer.batch_decode(output, skip_special_tokens=True))