Update README.md
Browse files
README.md
CHANGED
@@ -12,42 +12,95 @@ tags:
|
|
12 |
- vision
|
13 |
---
|
14 |
|
15 |
-
# CLIP
|
16 |
-
CLIP Italian is a CLIP-like Model for Italian. The CLIP model (Contrastive Language–Image Pre-training) was developed by researchers at OpenAI and is able to efficiently learn visual concepts from natural language supervision.
|
17 |
|
18 |
-
|
19 |
|
20 |
-
|
21 |
-
|
22 |
-
- [WIT](https://github.com/google-research-datasets/wit)
|
23 |
-
- [MSCOCO-IT](https://github.com/crux82/mscoco-it)
|
24 |
-
- [Conceptual Captions](https://ai.google.com/research/ConceptualCaptions/)
|
25 |
|
26 |
-
|
27 |
-
Preprocessing, hardware used, hyperparameters...
|
28 |
|
29 |
-
|
30 |
|
|
|
|
|
31 |
|
32 |
-
|
33 |
|
|
|
|
|
34 |
|
35 |
-
|
36 |
|
|
|
37 |
|
38 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
39 |
- Federico Bianchi ([vinid](https://huggingface.co/vinid))
|
40 |
- Raphael Pisoni ([4rtemi5](https://huggingface.co/4rtemi5))
|
41 |
- Giuseppe Attanasio ([g8a9](https://huggingface.co/g8a9))
|
42 |
- Silvia Terragni ([silviatti](https://huggingface.co/silviatti))
|
43 |
- Dario Balestri ([D3Reo](https://huggingface.co/D3Reo))
|
44 |
- Gabriele Sarti ([gsarti](https://huggingface.co/gsarti))
|
45 |
-
- Sri Lakshmi ([srisweet](https://huggingface.co/srisweet))
|
46 |
-
|
47 |
-
## Useful links
|
48 |
-
- [CLIP Blog post](https://openai.com/blog/clip/)
|
49 |
-
- [CLIP paper](https://arxiv.org/abs/2103.00020)
|
50 |
-
- [Community Week README](https://github.com/huggingface/transformers/blob/master/examples/research_projects/jax-projects/README.md)
|
51 |
-
- [Community Week channel](https://discord.com/channels/858019234139602994/859711887520038933)
|
52 |
-
- [Hybrid CLIP example scripts](https://github.com/huggingface/transformers/tree/master/examples/research_projects/jax-projects/hybrid_clip)
|
53 |
-
- [Model Repository](https://huggingface.co/clip-italian/clip-italian-final/)
|
12 |
- vision
|
13 |
---
|
14 |
|
15 |
+
# Italian CLIP
|
|
|
16 |
|
17 |
+
With a few tricks, we have been able to fine-tune a competitive Italian CLIP model with **only 1.4 million** training samples. Our Italian CLIP model is built upon the [Italian BERT](https://huggingface.co/dbmdz/bert-base-italian-xxl-cased) model provided by [dbmdz](https://huggingface.co/dbmdz) and the OpenAI [vision transformer](https://huggingface.co/openai/clip-vit-base-patch32).
|
18 |
|
19 |
+
Do you want to test our model right away? We got you covered! You just need to head to our [demo application](https://huggingface.co/spaces/clip-italian/clip-italian-demo).
|
20 |
+
The demo also contains all the details of the project, from training tricks to our most impressive results, and much more!
|
|
|
|
|
|
|
21 |
|
22 |
+
# Training data
|
|
|
23 |
|
24 |
+
We considered four main sources of data:
|
25 |
|
26 |
+
+ [WIT](https://github.com/google-research-datasets/wit) is an image-caption dataset collected from Wikipedia (see,
|
27 |
+
[Srinivasan et al., 2021](https://arxiv.org/pdf/2103.01913.pdf)).
|
28 |
|
29 |
+
+ [MSCOCO-IT](https://github.com/crux82/mscoco-it). This image-caption dataset comes from the work by [Scaiella et al., 2019](http://www.ai-lc.it/IJCoL/v5n2/IJCOL_5_2_3___scaiella_et_al.pdf).
|
30 |
|
31 |
+
+ [Conceptual Captions](https://ai.google.com/research/ConceptualCaptions/). This image-caption dataset comes from
|
32 |
+
the work by [Sharma et al., 2018](https://aclanthology.org/P18-1238.pdf).
|
33 |
|
34 |
+
+ [La Foto del Giorno](https://www.ilpost.it/foto-del-giorno/). This image-caption dataset is collected from [Il Post](https://www.ilpost.it/), a prominent Italian online newspaper.
|
35 |
|
36 |
+
We used better data augmentation, strategic training choices (we have way less data than the original CLIP paper), and backbone-freezing pre-training. For all the details on that, please refer to our [demo](https://huggingface.co/spaces/clip-italian/clip-italian-demo).
|
37 |
|
38 |
+
# Scientific Validity
|
39 |
+
|
40 |
+
## Quantitative Evaluation
|
41 |
+
|
42 |
+
To better understand how well our clip-italian model works we run an experimental evaluation. Since this is the first clip-based model in Italian, we used the multilingual CLIP model as a comparison baseline.
|
43 |
+
|
44 |
+
### mCLIP
|
45 |
+
|
46 |
+
The multilingual CLIP (henceforth, mCLIP), is a model introduced by [Nils Reimers](https://www.sbert.net/docs/pretrained_models.html) in his
|
47 |
+
[sentence-transformer](https://www.sbert.net/index.html) library. mCLIP is based on a multilingual encoder
|
48 |
+
that was created through multilingual knowledge distillation (see [Reimers et al., 2020](https://aclanthology.org/2020.emnlp-main.365/)).
|
49 |
+
|
50 |
+
### Tasks
|
51 |
+
|
52 |
+
We selected two different tasks:
|
53 |
+
+ image-retrieval
|
54 |
+
+ zero-shot classification
|
55 |
+
|
56 |
+
### Reproducibiliy
|
57 |
+
|
58 |
+
Both experiments should be very easy to replicate, we share the two colab notebook we used to compute the two results
|
59 |
+
|
60 |
+
+ [Image Retrieval](https://colab.research.google.com/drive/1bLVwVKpAndpEDHqjzxVPr_9nGrSbuOQd?usp=sharing)
|
61 |
+
+ [ImageNet Zero Shot Evaluation](https://colab.research.google.com/drive/1zfWeVWY79XXH63Ci-pk8xxx3Vu_RRgW-?usp=sharing)
|
62 |
+
|
63 |
+
|
64 |
+
### Image Retrieval
|
65 |
+
|
66 |
+
This experiment is run against the MSCOCO-IT validation set (that we haven't used in training). Given in input
|
67 |
+
a caption, we search for the most similar image in the MSCOCO-IT validation set. As evaluation metrics
|
68 |
+
we use the MRR@K.
|
69 |
+
|
70 |
+
| MRR | CLIP-Italian | mCLIP |
|
71 |
+
| --------------- | ------------ |-------|
|
72 |
+
| MRR@1 | **0.3797** | 0.2874|
|
73 |
+
| MRR@5 | **0.5039** | 0.3957|
|
74 |
+
| MRR@10 | **0.5204** | 0.4129|
|
75 |
+
|
76 |
+
It is true that we used MSCOCO-IT in training, and this might give us an advantage. However the original CLIP model was trained
|
77 |
+
on 400million images (and some of them probably were from MSCOCO).
|
78 |
+
|
79 |
+
|
80 |
+
### Zero-shot image classification
|
81 |
+
|
82 |
+
This experiment replicates the original one run by OpenAI on zero-shot image classification on ImageNet.
|
83 |
+
To do this, we used DeepL to translate the image labels in ImageNet. We evaluate the models computing the accuracy at different levels.
|
84 |
+
|
85 |
+
|
86 |
+
| Accuracy | CLIP-Italian | mCLIP |
|
87 |
+
| --------------- | ------------ |-------|
|
88 |
+
| Accuracy@1 | **22.11** | 20.15 |
|
89 |
+
| Accuracy@5 | **43.69** | 36.57 |
|
90 |
+
| Accuracy@10 | **52.55** | 42.91 |
|
91 |
+
| Accuracy@100 | **81.08** | 67.11 |
|
92 |
+
|
93 |
+
Our results confirm that CLIP-Italian is very competitive and beats mCLIP on the two different task
|
94 |
+
we have been testing. Note, however, that our results are lower than those shown in the original OpenAI
|
95 |
+
paper (see, [Radford et al., 2021](https://arxiv.org/abs/2103.00020)). However, considering that our results are in line with those obtained by mCLIP we think that
|
96 |
+
the translated image labels might have had an impact on the final scores.
|
97 |
+
|
98 |
+
|
99 |
+
# Team members
|
100 |
- Federico Bianchi ([vinid](https://huggingface.co/vinid))
|
101 |
- Raphael Pisoni ([4rtemi5](https://huggingface.co/4rtemi5))
|
102 |
- Giuseppe Attanasio ([g8a9](https://huggingface.co/g8a9))
|
103 |
- Silvia Terragni ([silviatti](https://huggingface.co/silviatti))
|
104 |
- Dario Balestri ([D3Reo](https://huggingface.co/D3Reo))
|
105 |
- Gabriele Sarti ([gsarti](https://huggingface.co/gsarti))
|
106 |
+
- Sri Lakshmi ([srisweet](https://huggingface.co/srisweet))
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|