g8a9 commited on
Commit
c0920ff
1 Parent(s): fc579f8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +76 -23
README.md CHANGED
@@ -12,42 +12,95 @@ tags:
12
  - vision
13
  ---
14
 
15
- # CLIP-Italian
16
- CLIP Italian is a CLIP-like Model for Italian. The CLIP model (Contrastive Language–Image Pre-training) was developed by researchers at OpenAI and is able to efficiently learn visual concepts from natural language supervision.
17
 
18
- We fine-tuned a competitive Italian CLIP model with only ~1.4 million Italian image-text pairs. This model is part of the [Flax/Jax Community Week](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google.
19
 
20
- ## Training Data
21
- We considered three main sources of data:
22
- - [WIT](https://github.com/google-research-datasets/wit)
23
- - [MSCOCO-IT](https://github.com/crux82/mscoco-it)
24
- - [Conceptual Captions](https://ai.google.com/research/ConceptualCaptions/)
25
 
26
- ## Training Procedure
27
- Preprocessing, hardware used, hyperparameters...
28
 
29
- ## Evaluation Performance
30
 
 
 
31
 
32
- ## Limitations
33
 
 
 
34
 
35
- ## Usage
36
 
 
37
 
38
- ## Team members
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
  - Federico Bianchi ([vinid](https://huggingface.co/vinid))
40
  - Raphael Pisoni ([4rtemi5](https://huggingface.co/4rtemi5))
41
  - Giuseppe Attanasio ([g8a9](https://huggingface.co/g8a9))
42
  - Silvia Terragni ([silviatti](https://huggingface.co/silviatti))
43
  - Dario Balestri ([D3Reo](https://huggingface.co/D3Reo))
44
  - Gabriele Sarti ([gsarti](https://huggingface.co/gsarti))
45
- - Sri Lakshmi ([srisweet](https://huggingface.co/srisweet))
46
-
47
- ## Useful links
48
- - [CLIP Blog post](https://openai.com/blog/clip/)
49
- - [CLIP paper](https://arxiv.org/abs/2103.00020)
50
- - [Community Week README](https://github.com/huggingface/transformers/blob/master/examples/research_projects/jax-projects/README.md)
51
- - [Community Week channel](https://discord.com/channels/858019234139602994/859711887520038933)
52
- - [Hybrid CLIP example scripts](https://github.com/huggingface/transformers/tree/master/examples/research_projects/jax-projects/hybrid_clip)
53
- - [Model Repository](https://huggingface.co/clip-italian/clip-italian-final/)
12
  - vision
13
  ---
14
 
15
+ # Italian CLIP
 
16
 
17
+ With a few tricks, we have been able to fine-tune a competitive Italian CLIP model with **only 1.4 million** training samples. Our Italian CLIP model is built upon the [Italian BERT](https://huggingface.co/dbmdz/bert-base-italian-xxl-cased) model provided by [dbmdz](https://huggingface.co/dbmdz) and the OpenAI [vision transformer](https://huggingface.co/openai/clip-vit-base-patch32).
18
 
19
+ Do you want to test our model right away? We got you covered! You just need to head to our [demo application](https://huggingface.co/spaces/clip-italian/clip-italian-demo).
20
+ The demo also contains all the details of the project, from training tricks to our most impressive results, and much more!
 
 
 
21
 
22
+ # Training data
 
23
 
24
+ We considered four main sources of data:
25
 
26
+ + [WIT](https://github.com/google-research-datasets/wit) is an image-caption dataset collected from Wikipedia (see,
27
+ [Srinivasan et al., 2021](https://arxiv.org/pdf/2103.01913.pdf)).
28
 
29
+ + [MSCOCO-IT](https://github.com/crux82/mscoco-it). This image-caption dataset comes from the work by [Scaiella et al., 2019](http://www.ai-lc.it/IJCoL/v5n2/IJCOL_5_2_3___scaiella_et_al.pdf).
30
 
31
+ + [Conceptual Captions](https://ai.google.com/research/ConceptualCaptions/). This image-caption dataset comes from
32
+ the work by [Sharma et al., 2018](https://aclanthology.org/P18-1238.pdf).
33
 
34
+ + [La Foto del Giorno](https://www.ilpost.it/foto-del-giorno/). This image-caption dataset is collected from [Il Post](https://www.ilpost.it/), a prominent Italian online newspaper.
35
 
36
+ We used better data augmentation, strategic training choices (we have way less data than the original CLIP paper), and backbone-freezing pre-training. For all the details on that, please refer to our [demo](https://huggingface.co/spaces/clip-italian/clip-italian-demo).
37
 
38
+ # Scientific Validity
39
+
40
+ ## Quantitative Evaluation
41
+
42
+ To better understand how well our clip-italian model works we run an experimental evaluation. Since this is the first clip-based model in Italian, we used the multilingual CLIP model as a comparison baseline.
43
+
44
+ ### mCLIP
45
+
46
+ The multilingual CLIP (henceforth, mCLIP), is a model introduced by [Nils Reimers](https://www.sbert.net/docs/pretrained_models.html) in his
47
+ [sentence-transformer](https://www.sbert.net/index.html) library. mCLIP is based on a multilingual encoder
48
+ that was created through multilingual knowledge distillation (see [Reimers et al., 2020](https://aclanthology.org/2020.emnlp-main.365/)).
49
+
50
+ ### Tasks
51
+
52
+ We selected two different tasks:
53
+ + image-retrieval
54
+ + zero-shot classification
55
+
56
+ ### Reproducibiliy
57
+
58
+ Both experiments should be very easy to replicate, we share the two colab notebook we used to compute the two results
59
+
60
+ + [Image Retrieval](https://colab.research.google.com/drive/1bLVwVKpAndpEDHqjzxVPr_9nGrSbuOQd?usp=sharing)
61
+ + [ImageNet Zero Shot Evaluation](https://colab.research.google.com/drive/1zfWeVWY79XXH63Ci-pk8xxx3Vu_RRgW-?usp=sharing)
62
+
63
+
64
+ ### Image Retrieval
65
+
66
+ This experiment is run against the MSCOCO-IT validation set (that we haven't used in training). Given in input
67
+ a caption, we search for the most similar image in the MSCOCO-IT validation set. As evaluation metrics
68
+ we use the MRR@K.
69
+
70
+ | MRR | CLIP-Italian | mCLIP |
71
+ | --------------- | ------------ |-------|
72
+ | MRR@1 | **0.3797** | 0.2874|
73
+ | MRR@5 | **0.5039** | 0.3957|
74
+ | MRR@10 | **0.5204** | 0.4129|
75
+
76
+ It is true that we used MSCOCO-IT in training, and this might give us an advantage. However the original CLIP model was trained
77
+ on 400million images (and some of them probably were from MSCOCO).
78
+
79
+
80
+ ### Zero-shot image classification
81
+
82
+ This experiment replicates the original one run by OpenAI on zero-shot image classification on ImageNet.
83
+ To do this, we used DeepL to translate the image labels in ImageNet. We evaluate the models computing the accuracy at different levels.
84
+
85
+
86
+ | Accuracy | CLIP-Italian | mCLIP |
87
+ | --------------- | ------------ |-------|
88
+ | Accuracy@1 | **22.11** | 20.15 |
89
+ | Accuracy@5 | **43.69** | 36.57 |
90
+ | Accuracy@10 | **52.55** | 42.91 |
91
+ | Accuracy@100 | **81.08** | 67.11 |
92
+
93
+ Our results confirm that CLIP-Italian is very competitive and beats mCLIP on the two different task
94
+ we have been testing. Note, however, that our results are lower than those shown in the original OpenAI
95
+ paper (see, [Radford et al., 2021](https://arxiv.org/abs/2103.00020)). However, considering that our results are in line with those obtained by mCLIP we think that
96
+ the translated image labels might have had an impact on the final scores.
97
+
98
+
99
+ # Team members
100
  - Federico Bianchi ([vinid](https://huggingface.co/vinid))
101
  - Raphael Pisoni ([4rtemi5](https://huggingface.co/4rtemi5))
102
  - Giuseppe Attanasio ([g8a9](https://huggingface.co/g8a9))
103
  - Silvia Terragni ([silviatti](https://huggingface.co/silviatti))
104
  - Dario Balestri ([D3Reo](https://huggingface.co/D3Reo))
105
  - Gabriele Sarti ([gsarti](https://huggingface.co/gsarti))
106
+ - Sri Lakshmi ([srisweet](https://huggingface.co/srisweet))