Spaces:
Running
Running
adding README.md and changing name to readme.md
Browse files- README.md +34 -0
- app.py +1 -1
- readme.md → introduction.md +17 -9
README.md
ADDED
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
---
|
3 |
+
title: Clip Italian Demo
|
4 |
+
emoji: ⚡
|
5 |
+
colorFrom: gray
|
6 |
+
colorTo: pink
|
7 |
+
sdk: streamlit
|
8 |
+
app_file: app.py
|
9 |
+
pinned: false
|
10 |
+
---
|
11 |
+
|
12 |
+
# Configuration
|
13 |
+
|
14 |
+
`title`: _string_
|
15 |
+
Display title for the Space
|
16 |
+
|
17 |
+
`emoji`: _string_
|
18 |
+
Space emoji (emoji-only character allowed)
|
19 |
+
|
20 |
+
`colorFrom`: _string_
|
21 |
+
Color for Thumbnail gradient (red, yellow, green, blue, indigo, purple, pink, gray)
|
22 |
+
|
23 |
+
`colorTo`: _string_
|
24 |
+
Color for Thumbnail gradient (red, yellow, green, blue, indigo, purple, pink, gray)
|
25 |
+
|
26 |
+
`sdk`: _string_
|
27 |
+
Can be either `gradio` or `streamlit`
|
28 |
+
|
29 |
+
`app_file`: _string_
|
30 |
+
Path to your main application file (which contains either `gradio` or `streamlit` Python code).
|
31 |
+
Path is relative to the root of the repository.
|
32 |
+
|
33 |
+
`pinned`: _boolean_
|
34 |
+
Whether the Space stays on top of your list.
|
app.py
CHANGED
@@ -108,5 +108,5 @@ if query:
|
|
108 |
|
109 |
st.image(image_paths)
|
110 |
|
111 |
-
intro_markdown = read_markdown_file("
|
112 |
st.markdown(intro_markdown, unsafe_allow_html=True)
|
|
|
108 |
|
109 |
st.image(image_paths)
|
110 |
|
111 |
+
intro_markdown = read_markdown_file("introduction.md")
|
112 |
st.markdown(intro_markdown, unsafe_allow_html=True)
|
readme.md → introduction.md
RENAMED
@@ -1,6 +1,8 @@
|
|
1 |
# Italian CLIP
|
2 |
|
3 |
-
With a few tricks, we have been able to fine-tune a competitive Italian CLIP model with **only 1.4 million** training samples.
|
|
|
|
|
4 |
|
5 |
In building this project we kept in mind the following principles:
|
6 |
|
@@ -32,12 +34,12 @@ We considered three main sources of data:
|
|
32 |
[Srinivasan et al., 2021](https://arxiv.org/pdf/2103.01913.pdf)). We focused on the *Reference Description* captions described in the paper as they are
|
33 |
the ones of highest quality. Nonetheless, many of these captions describe ontological knowledge and encyclopedic facts (e.g., Roberto Baggio in 1994).
|
34 |
However, this kind of text, without more information, is not useful to learn a good mapping between images and captions.
|
35 |
-
On the other hand, this text is written in Italian and it is of good quality.
|
36 |
-
|
37 |
-
on the text and removed all the captions that were composed for the 80% or more by PROPN. This is a simple solution that allowed us to retain much
|
38 |
of the dataset, without introducing noise.
|
39 |
|
40 |
-
|
41 |
|
42 |
+ [MSCOCO-IT](https://github.com/crux82/mscoco-it). This image-caption dataset comes from the work by [Scaiella et al., 2019](http://www.ai-lc.it/IJCoL/v5n2/IJCOL_5_2_3___scaiella_et_al.pdf). The captions comes from the original
|
43 |
MSCOCO dataset and have been translated with Microsoft Translator. The 2017 version of the MSCOCO training set contains more than
|
@@ -60,11 +62,16 @@ training pipeline: the optimizer and the training with frozen components.
|
|
60 |
|
61 |
### Optimizer
|
62 |
|
63 |
-
The standard AdamW didn't seem enough to train the model
|
64 |
-
|
65 |
|
66 |
### Backbone Freezing
|
67 |
|
|
|
|
|
|
|
|
|
|
|
68 |
<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="200"/>
|
69 |
|
70 |
# Scientific Validity
|
@@ -107,7 +114,8 @@ on 400million images (and some of them probably were from MSCOCO).
|
|
107 |
|
108 |
### Zero-shot image classification
|
109 |
|
110 |
-
This experiment replicates the original one run by OpenAI on zero-shot image classification.
|
|
|
111 |
|
112 |
|
113 |
| Accuracy | CLIP-Italian | mCLIP |
|
@@ -121,7 +129,7 @@ This experiment replicates the original one run by OpenAI on zero-shot image cla
|
|
121 |
|
122 |
Our results confirm that CLIP-Italian is very competitive and beats mCLIP on the two different task
|
123 |
we have been testing. Note, however, that our results are lower than those shown in the original OpenAI
|
124 |
-
paper (see, [Radford et al., 2021](https://arxiv.org/abs/2103.00020)), considering that our results are in line with those obtained by mCLIP we think that
|
125 |
the translated image labels might have had an impact on the final scores.
|
126 |
|
127 |
## Qualitative Evaluation
|
|
|
1 |
# Italian CLIP
|
2 |
|
3 |
+
With a few tricks, we have been able to fine-tune a competitive Italian CLIP model with **only 1.4 million** training samples. Our Italian CLIP model
|
4 |
+
is built upon the [Italian BERT](https://huggingface.co/dbmdz/bert-base-italian-xxl-cased) model provided by [dbmdz](https://huggingface.co/dbmdz) and the OpenAI
|
5 |
+
[vision transformer](https://huggingface.co/openai/clip-vit-base-patch32).
|
6 |
|
7 |
In building this project we kept in mind the following principles:
|
8 |
|
|
|
34 |
[Srinivasan et al., 2021](https://arxiv.org/pdf/2103.01913.pdf)). We focused on the *Reference Description* captions described in the paper as they are
|
35 |
the ones of highest quality. Nonetheless, many of these captions describe ontological knowledge and encyclopedic facts (e.g., Roberto Baggio in 1994).
|
36 |
However, this kind of text, without more information, is not useful to learn a good mapping between images and captions.
|
37 |
+
On the other hand, this text is written in Italian and it is of good quality. We cannot just remove short captions as some of those
|
38 |
+
are still good (e.g., "running dog"). Thus, to prevent polluting the data with captions that are not meaningful, we used *POS tagging*
|
39 |
+
on the text and removed all the captions that were composed for the 80% or more by PROPN (around ~10% of the data). This is a simple solution that allowed us to retain much
|
40 |
of the dataset, without introducing noise.
|
41 |
|
42 |
+
Captions like: *'Dora Riparia', 'Anna Maria Mozzoni', 'Joey Ramone Place', 'Kim Rhodes', 'Ralph George Hawtrey' * have been removed.
|
43 |
|
44 |
+ [MSCOCO-IT](https://github.com/crux82/mscoco-it). This image-caption dataset comes from the work by [Scaiella et al., 2019](http://www.ai-lc.it/IJCoL/v5n2/IJCOL_5_2_3___scaiella_et_al.pdf). The captions comes from the original
|
45 |
MSCOCO dataset and have been translated with Microsoft Translator. The 2017 version of the MSCOCO training set contains more than
|
|
|
62 |
|
63 |
### Optimizer
|
64 |
|
65 |
+
The standard AdamW didn't seem enough to train the model and thus we opted for a different optimization strategy. We eventually used AdaBelief with AGC and Cosine Annealing.
|
66 |
+
Our implementation is available online [here](https://github.com/clip-italian/clip-italian/blob/master/hybrid_clip/run_hybrid_clip.py#L667).
|
67 |
|
68 |
### Backbone Freezing
|
69 |
|
70 |
+
The ViT used by OpenAI was already trained on 400million images and it is the element in our architecture that probably required less training.
|
71 |
+
The same is true for the BERT model we use. Thus, we decided to do a first training with the backbone of our architecture completely frozen, to allow
|
72 |
+
the deeper layer to adapt to the new setting. Eventually, we run a new training, by fine-tuning al the components. This technique allowed us to
|
73 |
+
reach a much better validation loss.
|
74 |
+
|
75 |
<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="200"/>
|
76 |
|
77 |
# Scientific Validity
|
|
|
114 |
|
115 |
### Zero-shot image classification
|
116 |
|
117 |
+
This experiment replicates the original one run by OpenAI on zero-shot image classification on ImageNet. To do this, we used DeepL to
|
118 |
+
translate the image labels in ImageNet with DeepL. We evaluate the models computing the accuracy.
|
119 |
|
120 |
|
121 |
| Accuracy | CLIP-Italian | mCLIP |
|
|
|
129 |
|
130 |
Our results confirm that CLIP-Italian is very competitive and beats mCLIP on the two different task
|
131 |
we have been testing. Note, however, that our results are lower than those shown in the original OpenAI
|
132 |
+
paper (see, [Radford et al., 2021](https://arxiv.org/abs/2103.00020)). However, considering that our results are in line with those obtained by mCLIP we think that
|
133 |
the translated image labels might have had an impact on the final scores.
|
134 |
|
135 |
## Qualitative Evaluation
|