File size: 3,583 Bytes
80200b5
 
608a0a7
80200b5
2175e1c
 
6576840
 
 
2175e1c
6576840
 
 
2175e1c
80200b5
b7ddea4
80200b5
 
 
 
e5ec521
 
 
 
 
b7ddea4
e5ec521
 
 
b7ddea4
2175e1c
b7ddea4
2175e1c
e5ec521
 
80200b5
 
 
 
2175e1c
 
 
 
e5ec521
 
2175e1c
 
e5ec521
 
 
608a0a7
80200b5
 
b7ddea4
6576840
e5ec521
80200b5
 
6576840
e5ec521
6576840
608a0a7
 
6576840
e5ec521
80200b5
 
 
e5ec521
80200b5
b7ddea4
80200b5
e5ec521
 
 
 
 
 
 
b7ddea4
80200b5
e5ec521
 
 
 
 
 
 
6576840
 
e5ec521
 
 
 
b7ddea4
e5ec521
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
# Italian CLIP

With a few tricks, we have been able to fine-tune a competitive CLIP-italian model with only 1 million training samples.

In building this project we kept in mind the following things:

+ **Novel Contributions**: we tried to bring something new to the table;
+ **Scientific Validity**: models can look very cool, but external validation is important to assess the real impact;
+ **Broader Outlook**: we always considered which are the possible usages for this model.

We put our **hearts** and **souls** in this project during this week! Not only we worked on a cool project, but we were
able to meet new people and make new friends that worked together for a common goal! 
Thank you for this amazing opportunity, we hope you will like our project :heart:.

# Novel Contributions

The original CLIP model was trained on 400millions text-image pairs; this amount of data is not available for Italian and the only datasets for captioning in the literature are MSCOCO-IT (translated version of MSCOCO) and WIT. To get competitive results we follewed three directions: 1) more data 2) better augmentation and 3) better training.

## More Data

We eventually had to deal with the fact that we do not have the same data that OpenAI had during the training of CLIP.
Thus, we opted for one choice, data of medium-high quality.

We considered three main sources of data:

+ WIT. Most of these captions describe ontological knowledge and encyclopedic facts (e.g., Roberto Baggio in 1994). 
However, this kind of text, without more information, is not useful to learn a good mapping between images and captions. On the other hand,
this text is written in Italian and it is good quality. To prevent polluting the data with captions that are not meaningful, we used POS tagging
on the data and removed all the captions that were composed for the 80% or more by PROPN.

+ MSCOCO-IT.

+ Conceptual Captions.


## Better Augmentations

## Better Training

After different trials, we realized that the usual way of training this model was
not good enough to get good results. We thus modified two different parts of the
training pipeline: the optimizer and the training with frozen components.

### Optimizer

The standard AdamW didn't seem enough to train the model...


### Backbone Freezing

<img src="static/img/clip-italian.png" alt="drawing" width="200"/>

# Scientific Validity

## Quantitative Evaluation
Those images are definitely cool and interesting, but a model is nothing without validation.
To better understand how well our clip-italian model works we run an experimental evaluation. Since this is the first clip-based model in Italian, we used the multilingual CLIP model as a comparison baseline. 

### mCLIP

### Experiments Replication
We provide two colab notebooks to replicate both experiments.

### Tasks

We selected two different tasks: 
+ image-retrieval 
+ zero-shot classification

            
### Image Retrieval

| MRR             | CLIP-Italian | mCLIP |
| --------------- | ------------ |-------|
| MRR@1           |              |       |   
| MRR@5           |              |       |
| MRR@10          |              |       |


### Zero-shot classification

| Accuracy          | CLIP-Italian | mCLIP |
| --------------- | ------------ |-------|
| Accuracy@1      |              |       |   
| Accuracy@5      |              |       |
| Accuracy@10     |              |       |
| Accuracy@100    |  81.08       | 67.11 |

## Qualitative Evaluation

# Broader Outlook



# Other Notes
This readme has been designed using resources from Flaticon.com