File size: 3,888 Bytes
8d93b90 018272b 8543daa 8d93b90 e322cc5 be5b285 e322cc5 be79c08 e322cc5 be79c08 e322cc5 be79c08 e322cc5 fc6cc98 e322cc5 38ead43 e322cc5 be5b285 e322cc5 be5b285 e322cc5 be5b285 d38a495 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 |
---
license: mit
pipeline_tag: image-to-text
---
# Mocha Checkpoint for BLIP-Base Model
The official checkpoint of BLIP-Base model, finetuned on MS-COCO with the MOCHa RL framework, introduced in [Mitigating Open-Vocabulary Caption Hallucinations](https://arxiv.org/abs/2312.03631)
[Project Page](https://assafbk.github.io/mocha/)
## Usage
You can use this model for conditional and un-conditional image captioning
### Using the Pytorch model
#### Running the model on CPU
<details>
<summary> Click to expand </summary>
```python
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
processor = BlipProcessor.from_pretrained(""moranyanuka/blip-image-captioning-base-mocha"")
model = BlipForConditionalGeneration.from_pretrained("moranyanuka/blip-image-captioning-base-mocha")
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
```
</details>
#### Running the model on GPU
##### In full precision
<details>
<summary> Click to expand </summary>
```python
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
processor = BlipProcessor.from_pretrained("moranyanuka/blip-image-captioning-base-mocha")
model = BlipForConditionalGeneration.from_pretrained("moranyanuka/blip-image-captioning-base-mocha").to("cuda")
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt").to("cuda")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
```
</details>
##### In half precision (`float16`)
<details>
<summary> Click to expand </summary>
```python
import torch
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
processor = BlipProcessor.from_pretrained("moranyanuka/blip-image-captioning-base-mocha")
model = BlipForConditionalGeneration.from_pretrained("moranyanuka/blip-image-captioning-base-mocha", torch_dtype=torch.float16).to("cuda")
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda", torch.float16)
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# >>> a photography of a woman and her dog on the beach
# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the beach with a dog
```
</details>
bibtex:
```
@misc{benkish2024mitigating,
title={Mitigating Open-Vocabulary Caption Hallucinations},
author={Assaf Ben-Kish and Moran Yanuka and Morris Alper and Raja Giryes and Hadar Averbuch-Elor},
year={2024},
eprint={2312.03631},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
``` |