--- language: - en license: apache-2.0 tags: - llava - vlm --- The English Baichuan2-7B-Chat VLM trained via LORA for [See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding](https://arxiv.org/abs/2406.11665). Vision Encoder: [CLIP-L](https://huggingface.co/openai/clip-vit-large-patch14-336) Base LLM: [Baichuan2-7B-Chat](https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat) Training Corpus: - alignment: the corpus used by [LLAVA](https://proceedings.neurips.cc/paper_files/paper/2023/hash/6dcf277ea32ce3288914faf369fe6de0-Abstract-Conference.html) - visual instruction tuning: the corpus used by [LLAVA](https://proceedings.neurips.cc/paper_files/paper/2023/hash/6dcf277ea32ce3288914faf369fe6de0-Abstract-Conference.html) Alignment Script: https://github.com/amith-ananthram/mLLaVA/blob/main/scripts/v1_5/pretrain.sh Visual Instruction Tuning Script: https://github.com/amith-ananthram/mLLaVA/blob/main/scripts/v1_5/finetune_lora.sh Usage Example: import torch from PIL import Image from transformers import AutoTokenizer, AutoModelForVisualQuestionAnswering # from constants.py, utils.py, included as files in this HF release from constants import IMAGE_TOKEN_INDEX from utils import tokenizer_image_token, process_images device = torch.device('cuda') # load model and vision tower model = AutoModelForVisualQuestionAnswering.from_pretrained('amitha/mllava.baichuan2-en', trust_remote_code=True) model.model.vision_tower.load_model() model = model.eval().to(device) image_processor = model.get_vision_tower().image_processor tokenizer = AutoTokenizer.from_pretrained('baichuan-inc/Baichuan2-7B-Chat', trust_remote_code=True) prompt = '\nPlease describe this image.' input_ids = tokenizer_image_token( prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt" ) with Image.open("path/to/image.png") as img: images = process_images( [img.convert('RGB')], image_processor, model.config ).to(dtype=torch.float16) image_sizes = [img.size] with torch.no_grad(): output = model.generate( inputs=input_ids.unsqueeze(dim=0).to(device), attention_mask=torch.ones(input_ids.shape[0]).unsqueeze(dim=0).to(device), images=images.to(device), image_sizes=image_sizes ) print(tokenizer.batch_decode(output, skip_special_tokens=True))