vinid commited on
Commit
f1abd41
1 Parent(s): cf1218c

adding README.md and changing name to readme.md

Browse files
Files changed (3) hide show
  1. README.md +34 -0
  2. app.py +1 -1
  3. readme.md → introduction.md +17 -9
README.md ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ title: Clip Italian Demo
4
+ emoji: ⚡
5
+ colorFrom: gray
6
+ colorTo: pink
7
+ sdk: streamlit
8
+ app_file: app.py
9
+ pinned: false
10
+ ---
11
+
12
+ # Configuration
13
+
14
+ `title`: _string_
15
+ Display title for the Space
16
+
17
+ `emoji`: _string_
18
+ Space emoji (emoji-only character allowed)
19
+
20
+ `colorFrom`: _string_
21
+ Color for Thumbnail gradient (red, yellow, green, blue, indigo, purple, pink, gray)
22
+
23
+ `colorTo`: _string_
24
+ Color for Thumbnail gradient (red, yellow, green, blue, indigo, purple, pink, gray)
25
+
26
+ `sdk`: _string_
27
+ Can be either `gradio` or `streamlit`
28
+
29
+ `app_file`: _string_
30
+ Path to your main application file (which contains either `gradio` or `streamlit` Python code).
31
+ Path is relative to the root of the repository.
32
+
33
+ `pinned`: _boolean_
34
+ Whether the Space stays on top of your list.
app.py CHANGED
@@ -108,5 +108,5 @@ if query:
108
 
109
  st.image(image_paths)
110
 
111
- intro_markdown = read_markdown_file("readme.md")
112
  st.markdown(intro_markdown, unsafe_allow_html=True)
 
108
 
109
  st.image(image_paths)
110
 
111
+ intro_markdown = read_markdown_file("introduction.md")
112
  st.markdown(intro_markdown, unsafe_allow_html=True)
readme.md → introduction.md RENAMED
@@ -1,6 +1,8 @@
1
  # Italian CLIP
2
 
3
- With a few tricks, we have been able to fine-tune a competitive Italian CLIP model with **only 1.4 million** training samples.
 
 
4
 
5
  In building this project we kept in mind the following principles:
6
 
@@ -32,12 +34,12 @@ We considered three main sources of data:
32
  [Srinivasan et al., 2021](https://arxiv.org/pdf/2103.01913.pdf)). We focused on the *Reference Description* captions described in the paper as they are
33
  the ones of highest quality. Nonetheless, many of these captions describe ontological knowledge and encyclopedic facts (e.g., Roberto Baggio in 1994).
34
  However, this kind of text, without more information, is not useful to learn a good mapping between images and captions.
35
- On the other hand, this text is written in Italian and it is of good quality.
36
- To prevent polluting the data with captions that are not meaningful, we used *POS tagging*
37
- on the text and removed all the captions that were composed for the 80% or more by PROPN. This is a simple solution that allowed us to retain much
38
  of the dataset, without introducing noise.
39
 
40
- Example: ....
41
 
42
  + [MSCOCO-IT](https://github.com/crux82/mscoco-it). This image-caption dataset comes from the work by [Scaiella et al., 2019](http://www.ai-lc.it/IJCoL/v5n2/IJCOL_5_2_3___scaiella_et_al.pdf). The captions comes from the original
43
  MSCOCO dataset and have been translated with Microsoft Translator. The 2017 version of the MSCOCO training set contains more than
@@ -60,11 +62,16 @@ training pipeline: the optimizer and the training with frozen components.
60
 
61
  ### Optimizer
62
 
63
- The standard AdamW didn't seem enough to train the model...
64
-
65
 
66
  ### Backbone Freezing
67
 
 
 
 
 
 
68
  <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="200"/>
69
 
70
  # Scientific Validity
@@ -107,7 +114,8 @@ on 400million images (and some of them probably were from MSCOCO).
107
 
108
  ### Zero-shot image classification
109
 
110
- This experiment replicates the original one run by OpenAI on zero-shot image classification.
 
111
 
112
 
113
  | Accuracy | CLIP-Italian | mCLIP |
@@ -121,7 +129,7 @@ This experiment replicates the original one run by OpenAI on zero-shot image cla
121
 
122
  Our results confirm that CLIP-Italian is very competitive and beats mCLIP on the two different task
123
  we have been testing. Note, however, that our results are lower than those shown in the original OpenAI
124
- paper (see, [Radford et al., 2021](https://arxiv.org/abs/2103.00020)), considering that our results are in line with those obtained by mCLIP we think that
125
  the translated image labels might have had an impact on the final scores.
126
 
127
  ## Qualitative Evaluation
 
1
  # Italian CLIP
2
 
3
+ With a few tricks, we have been able to fine-tune a competitive Italian CLIP model with **only 1.4 million** training samples. Our Italian CLIP model
4
+ is built upon the [Italian BERT](https://huggingface.co/dbmdz/bert-base-italian-xxl-cased) model provided by [dbmdz](https://huggingface.co/dbmdz) and the OpenAI
5
+ [vision transformer](https://huggingface.co/openai/clip-vit-base-patch32).
6
 
7
  In building this project we kept in mind the following principles:
8
 
 
34
  [Srinivasan et al., 2021](https://arxiv.org/pdf/2103.01913.pdf)). We focused on the *Reference Description* captions described in the paper as they are
35
  the ones of highest quality. Nonetheless, many of these captions describe ontological knowledge and encyclopedic facts (e.g., Roberto Baggio in 1994).
36
  However, this kind of text, without more information, is not useful to learn a good mapping between images and captions.
37
+ On the other hand, this text is written in Italian and it is of good quality. We cannot just remove short captions as some of those
38
+ are still good (e.g., "running dog"). Thus, to prevent polluting the data with captions that are not meaningful, we used *POS tagging*
39
+ on the text and removed all the captions that were composed for the 80% or more by PROPN (around ~10% of the data). This is a simple solution that allowed us to retain much
40
  of the dataset, without introducing noise.
41
 
42
+ Captions like: *'Dora Riparia', 'Anna Maria Mozzoni', 'Joey Ramone Place', 'Kim Rhodes', 'Ralph George Hawtrey' * have been removed.
43
 
44
  + [MSCOCO-IT](https://github.com/crux82/mscoco-it). This image-caption dataset comes from the work by [Scaiella et al., 2019](http://www.ai-lc.it/IJCoL/v5n2/IJCOL_5_2_3___scaiella_et_al.pdf). The captions comes from the original
45
  MSCOCO dataset and have been translated with Microsoft Translator. The 2017 version of the MSCOCO training set contains more than
 
62
 
63
  ### Optimizer
64
 
65
+ The standard AdamW didn't seem enough to train the model and thus we opted for a different optimization strategy. We eventually used AdaBelief with AGC and Cosine Annealing.
66
+ Our implementation is available online [here](https://github.com/clip-italian/clip-italian/blob/master/hybrid_clip/run_hybrid_clip.py#L667).
67
 
68
  ### Backbone Freezing
69
 
70
+ The ViT used by OpenAI was already trained on 400million images and it is the element in our architecture that probably required less training.
71
+ The same is true for the BERT model we use. Thus, we decided to do a first training with the backbone of our architecture completely frozen, to allow
72
+ the deeper layer to adapt to the new setting. Eventually, we run a new training, by fine-tuning al the components. This technique allowed us to
73
+ reach a much better validation loss.
74
+
75
  <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="200"/>
76
 
77
  # Scientific Validity
 
114
 
115
  ### Zero-shot image classification
116
 
117
+ This experiment replicates the original one run by OpenAI on zero-shot image classification on ImageNet. To do this, we used DeepL to
118
+ translate the image labels in ImageNet with DeepL. We evaluate the models computing the accuracy.
119
 
120
 
121
  | Accuracy | CLIP-Italian | mCLIP |
 
129
 
130
  Our results confirm that CLIP-Italian is very competitive and beats mCLIP on the two different task
131
  we have been testing. Note, however, that our results are lower than those shown in the original OpenAI
132
+ paper (see, [Radford et al., 2021](https://arxiv.org/abs/2103.00020)). However, considering that our results are in line with those obtained by mCLIP we think that
133
  the translated image labels might have had an impact on the final scores.
134
 
135
  ## Qualitative Evaluation