4rtemi5 commited on
Commit
3f830ea
2 Parent(s): a9e905c 944631b

Merge branch 'main' of https://huggingface.co/spaces/clip-italian/clip-italian-demo

Browse files

# Conflicts:
# examples.py
# home.py
# image2text.py
# introduction.md
# text2image.py

Files changed (3) hide show
  1. examples.py +13 -14
  2. introduction.md +12 -9
  3. static/img/table_captions.png +0 -0
examples.py CHANGED
@@ -20,37 +20,36 @@ def app():
20
  st.markdown("### 1. Actors in Scenes")
21
  st.markdown("These examples were taken from the CC dataset")
22
 
23
- st.subheader("una coppia")
24
- st.markdown("*a couple*")
25
  st.image("static/img/examples/couple_0.jpeg")
26
 
27
  col1, col2 = st.beta_columns(2)
28
- col1.subheader("una coppia con il tramonto sullo sfondo")
29
- col1.markdown("*a couple with the sunset in the background*")
30
  col1.image("static/img/examples/couple_1.jpeg")
31
 
32
- col2.subheader("una coppia che passeggia sulla spiaggia")
33
- col2.markdown("*a couple walking on the beach*")
34
  col2.image("static/img/examples/couple_2.jpeg")
35
 
36
- st.subheader("una coppia che passeggia sulla spiaggia al tramonto")
37
- st.markdown("*a couple walking on the beach at sunset*")
38
  st.image("static/img/examples/couple_3.jpeg")
39
 
40
  st.markdown("### 2. Dresses")
41
  st.markdown("These examples were taken from the Unsplash dataset")
42
 
43
  col1, col2 = st.beta_columns(2)
44
- col1.subheader("un vestito primavrile")
45
- col1.markdown("*a dress for the spring*")
46
  col1.image("static/img/examples/vestito1.png")
47
 
48
- col2.subheader("un vestito autunnale")
49
- col2.markdown("*a dress for the autumn*")
50
  col2.image("static/img/examples/vestito_autunnale.png")
51
 
52
- #st.markdown("## Image Classification")
53
- st.markdown("<h2 style='text-align: center; color: #008C45; font-weight:bold;'> Zero Shot Image Classification </h2>", unsafe_allow_html=True)
54
  st.markdown("We report this cool example provided by the "
55
  "[DALLE-mini team](https://github.com/borisdayma/dalle-mini). "
56
  "Is the DALLE-mini logo an *avocado* or an armchair (*poltrona*)?")
 
20
  st.markdown("### 1. Actors in Scenes")
21
  st.markdown("These examples were taken from the CC dataset")
22
 
23
+ st.subheader("Una coppia")
24
+ st.markdown("*A couple*")
25
  st.image("static/img/examples/couple_0.jpeg")
26
 
27
  col1, col2 = st.beta_columns(2)
28
+ col1.subheader("Una coppia con il tramonto sullo sfondo")
29
+ col1.markdown("*A couple with the sunset in the background*")
30
  col1.image("static/img/examples/couple_1.jpeg")
31
 
32
+ col2.subheader("Una coppia che passeggia sulla spiaggia")
33
+ col2.markdown("*A couple walking on the beach*")
34
  col2.image("static/img/examples/couple_2.jpeg")
35
 
36
+ st.subheader("Una coppia che passeggia sulla spiaggia al tramonto")
37
+ st.markdown("*A couple walking on the beach at sunset*")
38
  st.image("static/img/examples/couple_3.jpeg")
39
 
40
  st.markdown("### 2. Dresses")
41
  st.markdown("These examples were taken from the Unsplash dataset")
42
 
43
  col1, col2 = st.beta_columns(2)
44
+ col1.subheader("Un vestito primaverile")
45
+ col1.markdown("*A dress for the spring*")
46
  col1.image("static/img/examples/vestito1.png")
47
 
48
+ col2.subheader("Un vestito autunnale")
49
+ col2.markdown("*A dress for the autumn*")
50
  col2.image("static/img/examples/vestito_autunnale.png")
51
 
52
+ st.markdown("## Image Classification")
 
53
  st.markdown("We report this cool example provided by the "
54
  "[DALLE-mini team](https://github.com/borisdayma/dalle-mini). "
55
  "Is the DALLE-mini logo an *avocado* or an armchair (*poltrona*)?")
introduction.md CHANGED
@@ -36,6 +36,7 @@ different applications that can start from here.
36
  The original CLIP model was trained on 400 million image-text pairs; this amount of data is currently not available for Italian.
37
  We indeed worked in a **low-resource setting**. The only datasets for Italian captioning in the literature are MSCOCO-IT (a translated version of MSCOCO) and WIT.
38
  To get competitive results, we followed three strategies:
 
39
  1. more and better data;
40
  2. better augmentations;
41
  3. better training strategies.
@@ -82,7 +83,7 @@ Each photo comes along with an Italian caption.
82
 
83
  Instead of relying on open-source translators, we decided to use DeepL. **Translation quality** of the data was the main
84
  reason of this choice. With the few images (wrt OpenAI) that we have, we cannot risk polluting our own data. CC is a great resource,
85
- but the captions have to be handled accordingly. We translated 700K captions and we evaluated their quality:
86
 
87
  Three of us looked at a sample of 100 of the translations and rated them with scores from 1 to 4.
88
  The meaning of the value is as follows: 1, the sentence has lost is meaning, or it's not possible to understand it; 2, it is possible to get the idea
@@ -97,6 +98,8 @@ weighting - of 0.858 (great agreement!).
97
  | person walking down the aisle | persona che cammina lungo la navata |
98
  | popular rides at night at the county fair | giostre popolari di notte alla fiera della contea |
99
 
 
 
100
  We know that we annotated our own data; in the spirit of fairness we also share the annotations and the captions so
101
  that those interested can check the quality. The Google Sheet is [here](https://docs.google.com/spreadsheets/d/1m6TkcpJbmJlEygL7SXURIq2w8ZHuVvsmdEuCIH0VENk/edit?usp=sharing).
102
 
@@ -192,7 +195,7 @@ described by the original caption. As evaluation metrics we use the MRR@K.
192
  | MRR@5 | **0.5039** | 0.3957|
193
  | MRR@10 | **0.5204** | 0.4129|
194
 
195
- _If the table above does not show, you can have a look at it [here](https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/table_imagenet.png)._
196
 
197
  It is true that we used the training set of MSCOCO-IT in training, and this might give us an advantage. However, the original CLIP model was trained
198
  on 400million images (and some of them might have been from MSCOCO).
@@ -210,7 +213,7 @@ We evaluate the models computing the accuracy at different levels.
210
  | Accuracy@10 | **52.55** | 42.91 |
211
  | Accuracy@100 | **81.08** | 67.11 |
212
 
213
- _If the table above doesn not show, you can have a look at it [here](https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/table_IR.png)._
214
 
215
  ### Discussion
216
 
@@ -233,24 +236,24 @@ Look at the following - slightly cherry picked - examples:
233
 
234
  ### Colors
235
  Here's "a yellow flower"
236
- <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/fiore_giallo.png" alt="drawing" width="600"/>
237
 
238
  And here's "a blue flower"
239
- <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/fiore_blu.png" alt="drawing" width="600"/>
240
 
241
  ### Counting
242
  What about "one cat"?
243
- <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/gatto.png" alt="drawing" width="600"/>
244
 
245
  And what about "two cats"?
246
- <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/due_gatti.png" alt="drawing" width="600"/>
247
 
248
  ### Complex Queries
249
  Have you ever seen "two brown horses"?
250
- <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/due_cavalli_marroni.png" alt="drawing" width="600"/>
251
 
252
  And finally, here's a very nice "cat on a chair"
253
- <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/gatto_su_sedia.png" alt="drawing" width="600"/>
254
 
255
 
256
  # Broader Outlook
 
36
  The original CLIP model was trained on 400 million image-text pairs; this amount of data is currently not available for Italian.
37
  We indeed worked in a **low-resource setting**. The only datasets for Italian captioning in the literature are MSCOCO-IT (a translated version of MSCOCO) and WIT.
38
  To get competitive results, we followed three strategies:
39
+
40
  1. more and better data;
41
  2. better augmentations;
42
  3. better training strategies.
 
83
 
84
  Instead of relying on open-source translators, we decided to use DeepL. **Translation quality** of the data was the main
85
  reason of this choice. With the few images (wrt OpenAI) that we have, we cannot risk polluting our own data. CC is a great resource,
86
+ but the captions have to be handled accordingly. We translated 700K captions and we evaluated their quality.
87
 
88
  Three of us looked at a sample of 100 of the translations and rated them with scores from 1 to 4.
89
  The meaning of the value is as follows: 1, the sentence has lost is meaning, or it's not possible to understand it; 2, it is possible to get the idea
 
98
  | person walking down the aisle | persona che cammina lungo la navata |
99
  | popular rides at night at the county fair | giostre popolari di notte alla fiera della contea |
100
 
101
+ _If the table above doesn't show, you can have a look at it [here](https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/table_captions.png)._
102
+
103
  We know that we annotated our own data; in the spirit of fairness we also share the annotations and the captions so
104
  that those interested can check the quality. The Google Sheet is [here](https://docs.google.com/spreadsheets/d/1m6TkcpJbmJlEygL7SXURIq2w8ZHuVvsmdEuCIH0VENk/edit?usp=sharing).
105
 
 
195
  | MRR@5 | **0.5039** | 0.3957|
196
  | MRR@10 | **0.5204** | 0.4129|
197
 
198
+ _If the table above doesn't show, you can have a look at it [here](https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/table_imagenet.png)._
199
 
200
  It is true that we used the training set of MSCOCO-IT in training, and this might give us an advantage. However, the original CLIP model was trained
201
  on 400million images (and some of them might have been from MSCOCO).
 
213
  | Accuracy@10 | **52.55** | 42.91 |
214
  | Accuracy@100 | **81.08** | 67.11 |
215
 
216
+ _If the table above doesn't show, you can have a look at it [here](https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/table_IR.png)._
217
 
218
  ### Discussion
219
 
 
236
 
237
  ### Colors
238
  Here's "a yellow flower"
239
+ <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/fiore_giallo.png" alt="drawing" width="500"/>
240
 
241
  And here's "a blue flower"
242
+ <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/fiore_blu.png" alt="drawing" width="500"/>
243
 
244
  ### Counting
245
  What about "one cat"?
246
+ <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/gatto.png" alt="drawing" width="500"/>
247
 
248
  And what about "two cats"?
249
+ <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/due_gatti.png" alt="drawing" width="500"/>
250
 
251
  ### Complex Queries
252
  Have you ever seen "two brown horses"?
253
+ <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/due_cavalli_marroni.png" alt="drawing" width="500"/>
254
 
255
  And finally, here's a very nice "cat on a chair"
256
+ <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/gatto_su_sedia.png" alt="drawing" width="500"/>
257
 
258
 
259
  # Broader Outlook
static/img/table_captions.png ADDED