|
--- |
|
license: apache-2.0 |
|
tags: |
|
- image-captioning |
|
languages: |
|
- en |
|
pipeline_tag: image-to-text |
|
datasets: |
|
- michelecafagna26/hl |
|
language: |
|
- en |
|
metrics: |
|
- sacrebleu |
|
- rouge |
|
library_name: transformers |
|
--- |
|
## GIT-base fine-tuned for Image Captioning on High-Level descriptions of Actions |
|
|
|
[GIT](https://arxiv.org/abs/2205.14100) base trained on the [HL dataset](https://huggingface.co/datasets/michelecafagna26/hl) for **action generation of images** |
|
|
|
## Model fine-tuning ποΈβ |
|
|
|
- Trained for 10 epochs |
|
- lr: 5eβ5 |
|
- Adam optimizer |
|
- half-precision (fp16) |
|
|
|
## Test set metrics π§Ύ |
|
|
|
| Cider | SacreBLEU | Rouge-L| |
|
|--------|------------|--------| |
|
| 110.63 | 15.21 | 30.45 | |
|
|
|
## Model in Action π |
|
|
|
```python |
|
import requests |
|
from PIL import Image |
|
from transformers import AutoProcessor, AutoModelForCausalLM |
|
|
|
processor = AutoProcessor.from_pretrained("git-base-captioning-ft-hl-actions") |
|
model = AutoModelForCausalLM.from_pretrained("git-base-captioning-ft-hl-actions").to("cuda") |
|
|
|
img_url = 'https://datasets-server.huggingface.co/assets/michelecafagna26/hl/--/default/train/0/image/image.jpg' |
|
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB') |
|
|
|
|
|
inputs = processor(raw_image, return_tensors="pt").to("cuda") |
|
pixel_values = inputs.pixel_values |
|
|
|
generated_ids = model.generate(pixel_values=pixel_values, max_length=50, |
|
do_sample=True, |
|
top_k=120, |
|
top_p=0.9, |
|
early_stopping=True, |
|
num_return_sequences=1) |
|
|
|
processor.batch_decode(generated_ids, skip_special_tokens=True) |
|
|
|
>>> "she is holding an umbrella." |
|
``` |
|
|
|
## BibTex and citation info |
|
|
|
```BibTeX |
|
``` |