Spaces:
Running
Running
File size: 4,187 Bytes
80200b5 acfaaf8 80200b5 acfaaf8 2175e1c acfaaf8 2175e1c 4a0f49b 2175e1c 80200b5 b7ddea4 acfaaf8 80200b5 e5ec521 4a0f49b e5ec521 b7ddea4 acfaaf8 b7ddea4 2175e1c b7ddea4 2175e1c e5ec521 80200b5 2175e1c e5ec521 2175e1c e5ec521 608a0a7 80200b5 b7ddea4 6576840 e5ec521 80200b5 6576840 e5ec521 6576840 608a0a7 6576840 e5ec521 80200b5 e5ec521 b7ddea4 80200b5 e5ec521 5fa6a85 e5ec521 b7ddea4 80200b5 4a0f49b e5ec521 4f04fa9 5fa6a85 e5ec521 6576840 e5ec521 b7ddea4 e5ec521 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 |
# Italian CLIP
With a few tricks, we have been able to fine-tune a competitive Italian CLIP model with **only 1.4 million** training samples.
In building this project we kept in mind the following principles:
+ **Novel Contributions**: We created a dataset of ~1.4 million Italian image-text pairs and, to the best of our knowledge, we trained the best Italian CLIP model currently in existence;
+ **Scientific Validity**: Claim are easy, facts are hard. That's why validation is important to assess the real impact of a model. We thoroughly evaluated our models in several tasks and made the validation reproducible for everybody.
+ **Broader Outlook**: We always kept in mind which are the possible usages for this model.
We put our **hearts** and **souls** into the project during this week! Not only did we work on a cool project, but we were
able to make new friends and and learn a lot from each other to work towards a common goal!
Thank you for this amazing opportunity, we hope you will like the results. :heart:
# Novel Contributions
The original CLIP model was trained on 400 million image-text pairs; this amount of data is not available for Italian.
We indeed worked in a **low-resource setting**. The only datasets for captioning in the literature are MSCOCO-IT (a translated version of MSCOCO) and WIT.
To get competitive results we followed three strategies:
1. more data;
2. better augmentations;
3. better training.
## More Data
We eventually had to deal with the fact that we do not have the same data that OpenAI had during the training of CLIP.
Thus, we tried to add as much data as possible while keeping the data-quality as high as possible.
We considered three main sources of data:
+ WIT. Most of these captions describe ontological knowledge and encyclopedic facts (e.g., Roberto Baggio in 1994).
However, this kind of text, without more information, is not useful to learn a good mapping between images and captions.
On the other hand, this text is written in Italian and it is good quality.
To prevent polluting the data with captions that are not meaningful, we used POS tagging
on the data and removed all the captions that were composed for the 80% or more by PROPN.
Example: ....
+ MSCOCO-IT.
+ Conceptual Captions.
## Better Augmentations
## Better Training
After different trials, we realized that the usual way of training this model was
not good enough to get good results. We thus modified two different parts of the
training pipeline: the optimizer and the training with frozen components.
### Optimizer
The standard AdamW didn't seem enough to train the model...
### Backbone Freezing
<img src="static/img/clip-italian.png" alt="drawing" width="200"/>
# Scientific Validity
## Quantitative Evaluation
Those images are definitely cool and interesting, but a model is nothing without validation.
To better understand how well our clip-italian model works we run an experimental evaluation. Since this is the first clip-based model in Italian, we used the multilingual CLIP model as a comparison baseline.
### mCLIP
### Experiments Replication
We provide two colab notebooks to replicate both experiments.
### Tasks
We selected two different tasks:
+ image-retrieval
+ zero-shot classification
### Image Retrieval
| MRR | CLIP-Italian | mCLIP |
| --------------- | ------------ |-------|
| MRR@1 | | |
| MRR@5 | | |
| MRR@10 | | |
[Colab: Image Retrieval Evaluation](https://colab.research.google.com/drive/1bLVwVKpAndpEDHqjzxVPr_9nGrSbuOQd?usp=sharing)
### Zero-shot classification
| Accuracy | CLIP-Italian | mCLIP |
| --------------- | ------------ |-------|
| Accuracy@1 | **22.11** | 20.15 |
| Accuracy@5 | **43.69** | 36.57 |
| Accuracy@10 | **52.55** | 42.91 |
| Accuracy@100 | **81.08** | 67.11 |
[Colab: ImageNet Zero Shot Evaluation](https://colab.research.google.com/drive/1zfWeVWY79XXH63Ci-pk8xxx3Vu_RRgW-?usp=sharing)
## Qualitative Evaluation
# Broader Outlook
# Other Notes
This readme has been designed using resources from Flaticon.com |