Spaces:
Running
Running
new stuff in readme.md
Browse files
readme.md
CHANGED
@@ -7,21 +7,67 @@ The original CLIP model was trained on 400millions text-image pairs; this amount
|
|
7 |
|
8 |
## More Data
|
9 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
10 |
## Better Augmentations
|
11 |
|
12 |
## Better Training
|
13 |
|
14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
|
16 |
# Scientific Validity
|
|
|
17 |
To better understand how well our clip-italian model works we run an experimental evaluation. Since this is the first clip-based model in Italian, we used the multilingual CLIP model as a comparison baseline.
|
18 |
|
|
|
|
|
|
|
|
|
19 |
We selected two different tasks:
|
20 |
+ image-retrieval
|
21 |
+ zero-shot classification
|
|
|
22 |
|
23 |
## Image Retrieval
|
24 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
25 |
## Zero-shot classification
|
26 |
|
27 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
7 |
|
8 |
## More Data
|
9 |
|
10 |
+
We eventually had to deal with the fact that we do not have the same data that OpenAI had during the training of CLIP.
|
11 |
+
Thus, we opted for one choice, data of medium-high quality.
|
12 |
+
|
13 |
+
We considered three main sources of data:
|
14 |
+
|
15 |
+
|
16 |
+
+ WIT. Most of this caption describe ontological knowledge and encyclopedic facts (e.g., Roberto Baggio in 1994).
|
17 |
+
However, this kind of text, without more information, is not useful to learn a good mapping between images and captions. On the other hand,
|
18 |
+
this text is written in Italian and it is good quality. To prevent polluting the data with captions that are not meaningful, we used POS tagging
|
19 |
+
on the data and removed all the captions that were composed for the 80% or more by PROPN.
|
20 |
+
+ MSCOCO-IT
|
21 |
+
+ CC
|
22 |
+
|
23 |
+
|
24 |
## Better Augmentations
|
25 |
|
26 |
## Better Training
|
27 |
|
28 |
+
### Optimizer
|
29 |
+
|
30 |
+
|
31 |
+
### Backbone Freezing
|
32 |
+
|
33 |
+
![Backbone Freezing](static/img/clip-italian.png)
|
34 |
+
|
35 |
+
|
36 |
|
37 |
# Scientific Validity
|
38 |
+
Those images are definitely cool and interesting, but a model is nothing without validation.
|
39 |
To better understand how well our clip-italian model works we run an experimental evaluation. Since this is the first clip-based model in Italian, we used the multilingual CLIP model as a comparison baseline.
|
40 |
|
41 |
+
## mCLIP
|
42 |
+
|
43 |
+
## Tasks
|
44 |
+
|
45 |
We selected two different tasks:
|
46 |
+ image-retrieval
|
47 |
+ zero-shot classification
|
48 |
+
|
49 |
|
50 |
## Image Retrieval
|
51 |
|
52 |
+
| MRR | CLIP-Italian | mCLIP |
|
53 |
+
| --------------- | ------------ |-------|
|
54 |
+
| MRR@1 | | |
|
55 |
+
| MRR@5 | | |
|
56 |
+
| MRR@10 | | |
|
57 |
+
|
58 |
+
|
59 |
## Zero-shot classification
|
60 |
|
61 |
+
| Accuracy | CLIP-Italian | mCLIP |
|
62 |
+
| --------------- | ------------ |-------|
|
63 |
+
| Accuracy@1 | | |
|
64 |
+
| Accuracy@5 | | |
|
65 |
+
| Accuracy@10 | | |
|
66 |
+
| Accuracy@100 | 81.08 | 67.11 |
|
67 |
+
|
68 |
+
# Broader Outlook
|
69 |
+
|
70 |
+
|
71 |
+
|
72 |
+
|
73 |
+
This readme has been designed using resources from Flaticon.com
|