File size: 2,148 Bytes
842136c
 
1a4b1f8
c0e9512
 
 
842136c
e361352
2355c93
 
 
 
 
 
 
 
 
5b1dea6
2355c93
3be0e04
e977b11
de54ab3
3be0e04
6540118
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2355c93
 
6540118
2355c93
 
 
 
b6793ca
 
 
 
 
 
2355c93
842136c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
---
license: mit
inference: false
pipeline_tag: image-to-text
tags:
- image-captioning
---
# FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions

A framework designed to generate semantically rich image captions.

## Resources

- πŸ’» **Project Page**: For more details, visit the official [project page](https://rotsteinnoam.github.io/FuseCap/).

- πŸ“ **Read the Paper**: You can find the paper [here](https://arxiv.org/abs/2305.17718).
    
- πŸš€ **Demo**: Try out our BLIP-based model [demo](https://huggingface.co/spaces/noamrot/FuseCap) trained using FuseCap.

- πŸ“‚ **Code Repository**: The code for FuseCap can be found in the [GitHub repository](https://github.com/RotsteinNoam/FuseCap).
  
- πŸ—ƒοΈ **Datasets**: The  fused captions datasets can be accessed from [here](https://github.com/RotsteinNoam/FuseCap#datasets).
  
#### Running the model

Our BLIP-based model can be run using the following code,

```python
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
processor = BlipProcessor.from_pretrained("noamrot/FuseCap")
model = BlipForConditionalGeneration.from_pretrained("noamrot/FuseCap").to(device)

img_url = 'https://huggingface.co/spaces/noamrot/FuseCap/resolve/main/bike.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

text = "a picture of "
inputs = processor(raw_image, text, return_tensors="pt").to(device)

out = model.generate(**inputs, num_beams = 3)
print(processor.decode(out[0], skip_special_tokens=True))
```

## Upcoming Updates

The official codebase, datasets and trained models for this project will be released soon.

## BibTeX

``` Citation
@inproceedings{rotstein2024fusecap,
  title={Fusecap: Leveraging large language models for enriched fused image captions},
  author={Rotstein, Noam and Bensa{\"\i}d, David and Brody, Shaked and Ganz, Roy and Kimmel, Ron},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
  pages={5689--5700},
  year={2024}
}
```