--- datasets: - liuhaotian/LLaVA-Pretrain - liuhaotian/LLaVA-Instruct-150K language: - en tags: - llava - phi license: mit library_name: transformers widget: - text: "What animal is it?" src: "https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg" - text: "Where is it?" src: "https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg" --- # Multi-crop LLaVA-3b

## Model details Usually, in LLaVA models, we generate N embeddings for the image, which we then combine with text embeddings and send to the LLM. But what if instead of creating N tokens for one image, we create K<user {prompt}<|im_end|> <|im_start|>assistant ``` ## How to use ```python from transformers import AutoModel, AutoProcessor import torch model = AutoModel.from_pretrained("visheratin/MC-LLaVA-3b", torch_dtype=torch.float16, trust_remote_code=True).to("cuda") processor = AutoProcessor.from_pretrained("visheratin/MC-LLaVA-3b", trust_remote_code=True) with torch.inference_mode(): inputs = processor(prompt, [raw_image], model, max_crops=100, num_tokens=728) output = model.generate(**inputs, max_new_tokens=200, use_cache=True, do_sample=False, eos_token_id=processor.tokenizer.eos_token_id, pad_token_id=processor.tokenizer.eos_token_id) result = processor.tokenizer.decode(output[0]).replace(prompt, "").replace("<|im_end|>", "") print(result) ``` ## Benchmarks - TextVQA - 50.9% - GQA - 59.5% - VQAv2 - 76.72% - VizWiz - 32.68% - V*-bench - OCR - 56.66%, GPT4V-hard - 52.94%, direct attributes - 40.86%, relative position - 56.57% ## Examples

## License The model is licensed under MIT license, but since the data used for model training is largely synthetic, you should also follow OpenAI and Google Gemini terms of service. Which means don't create competitor models for them. ## Acknowledgments Thanks to [Lambda](https://lambdalabs.com/) for providing a machine to train the model. Thanks to [ML Collective](https://mlcollective.org/) for continuous support and providing compute resources for testing the model.