Mocha Checkpoint for BLIP-Base Model
The official checkpoint of BLIP-Base model, finetuned on MS-COCO with the MOCHa RL framework, introduced in Mitigating Open-Vocabulary Caption Hallucinations
Usage
You can use this model for conditional and un-conditional image captioning
Using the Pytorch model
Running the model on CPU
Click to expand
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
processor = BlipProcessor.from_pretrained(""moranyanuka/blip-image-captioning-base-mocha"")
model = BlipForConditionalGeneration.from_pretrained("moranyanuka/blip-image-captioning-base-mocha")
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
Running the model on GPU
In full precision
Click to expand
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
processor = BlipProcessor.from_pretrained("moranyanuka/blip-image-captioning-base-mocha")
model = BlipForConditionalGeneration.from_pretrained("moranyanuka/blip-image-captioning-base-mocha").to("cuda")
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt").to("cuda")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
In half precision (float16
)
Click to expand
import torch
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
processor = BlipProcessor.from_pretrained("moranyanuka/blip-image-captioning-base-mocha")
model = BlipForConditionalGeneration.from_pretrained("moranyanuka/blip-image-captioning-base-mocha", torch_dtype=torch.float16).to("cuda")
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda", torch.float16)
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# >>> a photography of a woman and her dog on the beach
# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the beach with a dog
bibtex:
@misc{benkish2024mitigating,
title={Mitigating Open-Vocabulary Caption Hallucinations},
author={Assaf Ben-Kish and Moran Yanuka and Morris Alper and Raja Giryes and Hadar Averbuch-Elor},
year={2024},
eprint={2312.03631},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
- Downloads last month
- 110
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.