|
--- |
|
language: |
|
- en |
|
license: apache-2.0 |
|
tags: |
|
- llava |
|
- vlm |
|
--- |
|
|
|
The English Baichuan2-7B-Chat VLM trained via LORA for [See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding](https://arxiv.org/abs/2406.11665). |
|
|
|
Vision Encoder: [CLIP-L](https://huggingface.co/openai/clip-vit-large-patch14-336) |
|
|
|
Base LLM: [Baichuan2-7B-Chat](https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat) |
|
|
|
Training Corpus: |
|
- alignment: the corpus used by [LLAVA](https://proceedings.neurips.cc/paper_files/paper/2023/hash/6dcf277ea32ce3288914faf369fe6de0-Abstract-Conference.html) |
|
- visual instruction tuning: the corpus used by [LLAVA](https://proceedings.neurips.cc/paper_files/paper/2023/hash/6dcf277ea32ce3288914faf369fe6de0-Abstract-Conference.html) |
|
|
|
Alignment Script: https://github.com/amith-ananthram/mLLaVA/blob/main/scripts/v1_5/pretrain.sh |
|
|
|
Visual Instruction Tuning Script: https://github.com/amith-ananthram/mLLaVA/blob/main/scripts/v1_5/finetune_lora.sh |
|
|
|
Usage Example: |
|
|
|
import torch |
|
from PIL import Image |
|
from transformers import AutoTokenizer, AutoModelForVisualQuestionAnswering |
|
|
|
# from constants.py, utils.py, included as files in this HF release |
|
from constants import IMAGE_TOKEN_INDEX |
|
from utils import tokenizer_image_token, process_images |
|
|
|
device = torch.device('cuda') |
|
|
|
# load model and vision tower |
|
model = AutoModelForVisualQuestionAnswering.from_pretrained('amitha/mllava.baichuan2-en', trust_remote_code=True) |
|
model.model.vision_tower.load_model() |
|
model = model.eval().to(device) |
|
|
|
image_processor = model.get_vision_tower().image_processor |
|
tokenizer = AutoTokenizer.from_pretrained('baichuan-inc/Baichuan2-7B-Chat', trust_remote_code=True) |
|
|
|
prompt = '<reserved_106><image>\nPlease describe this image.<reserved_107>' |
|
|
|
input_ids = tokenizer_image_token( |
|
prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt" |
|
) |
|
with Image.open("path/to/image.png") as img: |
|
images = process_images( |
|
[img.convert('RGB')], image_processor, model.config |
|
).to(dtype=torch.float16) |
|
image_sizes = [img.size] |
|
|
|
with torch.no_grad(): |
|
output = model.generate( |
|
inputs=input_ids.unsqueeze(dim=0).to(device), |
|
attention_mask=torch.ones(input_ids.shape[0]).unsqueeze(dim=0).to(device), |
|
images=images.to(device), |
|
image_sizes=image_sizes |
|
) |
|
|
|
print(tokenizer.batch_decode(output, skip_special_tokens=True)) |
|
|