michelecafagna26 commited on
Commit
ff093cf
1 Parent(s): f34fd3d

Upload 2 files

Browse files
Files changed (2) hide show
  1. README.md +101 -0
  2. pytorch_model.pt +3 -0
README.md CHANGED
@@ -1,3 +1,104 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ tags:
4
+ - image-captioning
5
+ languages:
6
+ - en
7
+ datasets:
8
+ - michelecafagna26/hl
9
+ language:
10
+ - en
11
+ metrics:
12
+ - sacrebleu
13
+ - rouge
14
+ library_name: transformers
15
  ---
16
+ ## ClipCap fine-tuned for Scenes Image Captioning
17
+
18
+ [ClipCap](https://arxiv.org/abs/2111.09734) base trained on the [HL Dataset](https://huggingface.co/datasets/michelecafagna26/hl) for **high-level scene descriptions generation**
19
+
20
+ ## Model fine-tuning 🏋️‍
21
+
22
+ We fine-tune LM + Mapping Network starting grom the model pretrained on COCO
23
+
24
+ - Trained for a 9 epochs
25
+ - lr: 5e−5
26
+ - Adam optimizer
27
+ - half-precision (fp16)
28
+
29
+ ## Test set metrics 🧾
30
+
31
+ | Cider | SacreBLEU | Rouge-L|
32
+ |---------|------------|--------|
33
+ | 145.93 | 36.73 | 42.83 |
34
+
35
+ ## Demo
36
+
37
+ [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1xcaJOxaAp8TRd8a6x1XnAptVjHQRv3Zj?usp=sharing)
38
+
39
+ ## Installation
40
+
41
+ ```bash
42
+ pip install git+https://github.com/michelecafagna26/CLIPCap.git
43
+ ```
44
+
45
+ ## Download the model
46
+
47
+ ```bash
48
+ git lfs install # if not installed
49
+ git clone https://huggingface.co/michelecafagna26/clipcap-base-captioning-ft-hl-scenes
50
+ ```
51
+
52
+ ## Model in Action 🚀
53
+
54
+
55
+ ```python
56
+ from clipcap import ClipCaptionModel
57
+ import torch
58
+ from transformers import (
59
+ GPT2Tokenizer,
60
+ GPT2LMHeadModel,
61
+ )
62
+ import torch
63
+ import clip
64
+ import requests
65
+ from PIL import Image
66
+
67
+ model_path = "clipcap-base-captioning-ft-hl-scenes/pytorch_model.pt" # change accordingly
68
+
69
+ # load clip
70
+ device = "cuda" if torch.cuda.is_available() else "cpu"
71
+ clip_model, preprocess = clip.load("ViT-B/32", device=device, jit=False)
72
+ tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
73
+ prefix_length = 10
74
+
75
+ # load ClipCap
76
+ model = ClipCaptionModel(prefix_length, tokenizer=tokenizer)
77
+ model.from_pretrained(model_path)
78
+ model = model.eval()
79
+ model = model.to(device)
80
+
81
+ # load the image
82
+ img_url = 'https://datasets-server.huggingface.co/assets/michelecafagna26/hl/--/default/train/0/image/image.jpg'
83
+ raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
84
+
85
+
86
+ # extract the prefix
87
+ image = preprocess(raw_image).unsqueeze(0).to(device)
88
+ with torch.no_grad():
89
+ prefix = clip_model.encode_image(image).to(
90
+ device, dtype=torch.float32
91
+ )
92
+ prefix_embed = model.clip_project(prefix).reshape(1, prefix_length, -1)
93
+
94
+ # generate the caption
95
+ model.generate_beam(embed=prefix_embed)[0]
96
+
97
+
98
+ # >> ""
99
+ ```
100
+
101
+ ## BibTex and citation info
102
+
103
+ ```BibTeX
104
+ ```
pytorch_model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b25da1e3aca001bc0a922e1f63f2069d3620198fff3030d3656ca974cbe9b2cd
3
+ size 636274141