--- license: mit pipeline_tag: image-to-text --- # Mocha Checkpoint for BLIP-Base Model The official checkpoint of BLIP-Base model, finetuned on MS-COCO with the MOCHa RL framework, introduced in [Mitigating Open-Vocabulary Caption Hallucinations](https://arxiv.org/abs/2312.03631) [Project Page](https://assafbk.github.io/mocha/) ## Usage You can use this model for conditional and un-conditional image captioning ### Using the Pytorch model #### Running the model on CPU
Click to expand ```python import requests from PIL import Image from transformers import BlipProcessor, BlipForConditionalGeneration processor = BlipProcessor.from_pretrained(""moranyanuka/blip-image-captioning-base-mocha"") model = BlipForConditionalGeneration.from_pretrained("moranyanuka/blip-image-captioning-base-mocha") img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB') # conditional image captioning text = "a photography of" inputs = processor(raw_image, text, return_tensors="pt") out = model.generate(**inputs) print(processor.decode(out[0], skip_special_tokens=True)) # unconditional image captioning inputs = processor(raw_image, return_tensors="pt") out = model.generate(**inputs) print(processor.decode(out[0], skip_special_tokens=True)) ```
#### Running the model on GPU ##### In full precision
Click to expand ```python import requests from PIL import Image from transformers import BlipProcessor, BlipForConditionalGeneration processor = BlipProcessor.from_pretrained("moranyanuka/blip-image-captioning-base-mocha") model = BlipForConditionalGeneration.from_pretrained("moranyanuka/blip-image-captioning-base-mocha").to("cuda") img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB') # conditional image captioning text = "a photography of" inputs = processor(raw_image, text, return_tensors="pt").to("cuda") out = model.generate(**inputs) print(processor.decode(out[0], skip_special_tokens=True)) # unconditional image captioning inputs = processor(raw_image, return_tensors="pt").to("cuda") out = model.generate(**inputs) print(processor.decode(out[0], skip_special_tokens=True)) ```
##### In half precision (`float16`)
Click to expand ```python import torch import requests from PIL import Image from transformers import BlipProcessor, BlipForConditionalGeneration processor = BlipProcessor.from_pretrained("moranyanuka/blip-image-captioning-base-mocha") model = BlipForConditionalGeneration.from_pretrained("moranyanuka/blip-image-captioning-base-mocha", torch_dtype=torch.float16).to("cuda") img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB') # conditional image captioning text = "a photography of" inputs = processor(raw_image, text, return_tensors="pt").to("cuda", torch.float16) out = model.generate(**inputs) print(processor.decode(out[0], skip_special_tokens=True)) # >>> a photography of a woman and her dog on the beach # unconditional image captioning inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16) out = model.generate(**inputs) print(processor.decode(out[0], skip_special_tokens=True)) >>> a woman sitting on the beach with a dog ```
bibtex: ``` @misc{benkish2024mitigating, title={Mitigating Open-Vocabulary Caption Hallucinations}, author={Assaf Ben-Kish and Moran Yanuka and Morris Alper and Raja Giryes and Hadar Averbuch-Elor}, year={2024}, eprint={2312.03631}, archivePrefix={arXiv}, primaryClass={cs.CV} } ```