clip-italian
/

clip-italian

@@ -12,42 +12,95 @@ tags:
 - vision
 ---
-# CLIP-Italian
-CLIP Italian is a CLIP-like Model for Italian. The CLIP model (Contrastive Language–Image Pre-training) was developed by researchers at OpenAI and is able to efficiently learn visual concepts from natural language supervision.
-We fine-tuned a competitive Italian CLIP model with only ~1.4 million Italian image-text pairs. This model is part of the [Flax/Jax Community Week](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google.
-## Training Data
-We considered three main sources of data:
-- [WIT](https://github.com/google-research-datasets/wit)
-- [MSCOCO-IT](https://github.com/crux82/mscoco-it)
-- [Conceptual Captions](https://ai.google.com/research/ConceptualCaptions/)
-## Training Procedure
-Preprocessing, hardware used, hyperparameters...
-## Evaluation Performance
-## Limitations
-## Usage
-## Team members
 - Federico Bianchi ([vinid](https://huggingface.co/vinid))
 - Raphael Pisoni ([4rtemi5](https://huggingface.co/4rtemi5))
 - Giuseppe Attanasio ([g8a9](https://huggingface.co/g8a9))
 - Silvia Terragni ([silviatti](https://huggingface.co/silviatti))
 - Dario Balestri ([D3Reo](https://huggingface.co/D3Reo))
 - Gabriele Sarti ([gsarti](https://huggingface.co/gsarti))
-- Sri Lakshmi ([srisweet](https://huggingface.co/srisweet))
-## Useful links
-- [CLIP Blog post](https://openai.com/blog/clip/)
-- [CLIP paper](https://arxiv.org/abs/2103.00020)
-- [Community Week README](https://github.com/huggingface/transformers/blob/master/examples/research_projects/jax-projects/README.md)
-- [Community Week channel](https://discord.com/channels/858019234139602994/859711887520038933)
-- [Hybrid CLIP example scripts](https://github.com/huggingface/transformers/tree/master/examples/research_projects/jax-projects/hybrid_clip)
-- [Model Repository](https://huggingface.co/clip-italian/clip-italian-final/)

 - vision
 ---
+# Italian CLIP
+With a few tricks, we have been able to fine-tune a competitive Italian CLIP model with **only 1.4 million** training samples. Our Italian CLIP model is built upon the [Italian BERT](https://huggingface.co/dbmdz/bert-base-italian-xxl-cased) model provided by [dbmdz](https://huggingface.co/dbmdz) and the OpenAI [vision transformer](https://huggingface.co/openai/clip-vit-base-patch32).
+Do you want to test our model right away? We got you covered! You just need to head to our [demo application](https://huggingface.co/spaces/clip-italian/clip-italian-demo).
+The demo also contains all the details of the project, from training tricks to our most impressive results, and much more!
+# Training data
+We considered four main sources of data:
++ [WIT](https://github.com/google-research-datasets/wit) is an image-caption dataset collected from Wikipedia (see,
+[Srinivasan et al., 2021](https://arxiv.org/pdf/2103.01913.pdf)).
++ [MSCOCO-IT](https://github.com/crux82/mscoco-it). This image-caption dataset comes from the work by [Scaiella et al., 2019](http://www.ai-lc.it/IJCoL/v5n2/IJCOL_5_2_3___scaiella_et_al.pdf).
++ [Conceptual Captions](https://ai.google.com/research/ConceptualCaptions/). This image-caption dataset comes from
+the work by [Sharma et al., 2018](https://aclanthology.org/P18-1238.pdf).
++ [La Foto del Giorno](https://www.ilpost.it/foto-del-giorno/). This image-caption dataset is collected from [Il Post](https://www.ilpost.it/), a prominent Italian online newspaper.
+We used better data augmentation, strategic training choices (we have way less data than the original CLIP paper), and backbone-freezing pre-training. For all the details on that, please refer to our [demo](https://huggingface.co/spaces/clip-italian/clip-italian-demo).
+# Scientific Validity
+## Quantitative Evaluation
+To better understand how well our clip-italian model works we run an experimental evaluation. Since this is the first clip-based model in Italian, we used the multilingual CLIP model as a comparison baseline.
+### mCLIP
+The multilingual CLIP (henceforth, mCLIP), is a model introduced by [Nils Reimers](https://www.sbert.net/docs/pretrained_models.html) in his
+[sentence-transformer](https://www.sbert.net/index.html) library. mCLIP is based on a multilingual encoder
+that was created through multilingual knowledge distillation (see [Reimers et al., 2020](https://aclanthology.org/2020.emnlp-main.365/)).
+### Tasks
+We selected two different tasks:
++ image-retrieval
++ zero-shot classification
+### Reproducibiliy
+Both experiments should be very easy to replicate, we share the two colab notebook we used to compute the two results
++ [Image Retrieval](https://colab.research.google.com/drive/1bLVwVKpAndpEDHqjzxVPr_9nGrSbuOQd?usp=sharing)
++ [ImageNet Zero Shot Evaluation](https://colab.research.google.com/drive/1zfWeVWY79XXH63Ci-pk8xxx3Vu_RRgW-?usp=sharing)
+### Image Retrieval
+This experiment is run against the MSCOCO-IT validation set (that we haven't used in training). Given in input
+a caption, we search for the most similar image in the MSCOCO-IT validation set. As evaluation metrics
+we use the MRR@K.
+| MRR             | CLIP-Italian | mCLIP |
+| --------------- | ------------ |-------|
+| MRR@1           | **0.3797**   | 0.2874|
+| MRR@5           | **0.5039**   | 0.3957|
+| MRR@10          | **0.5204**   | 0.4129|
+It is true that we used MSCOCO-IT in training, and this might give us an advantage. However the original CLIP model was trained
+on 400million images (and some of them probably were from MSCOCO).
+### Zero-shot image classification
+This experiment replicates the original one run by OpenAI on zero-shot image classification on ImageNet.
+To do this, we used DeepL to translate the image labels in ImageNet. We evaluate the models computing the accuracy at different levels.
+| Accuracy        | CLIP-Italian | mCLIP |
+| --------------- | ------------ |-------|
+| Accuracy@1      |  **22.11**   | 20.15 |
+| Accuracy@5      |  **43.69**   | 36.57 |
+| Accuracy@10     |  **52.55**   | 42.91 |
+| Accuracy@100    |  **81.08**   | 67.11 |
+Our results confirm that CLIP-Italian is very competitive and beats mCLIP on the two different task
+we have been testing. Note, however, that our results are lower than those shown in the original OpenAI
+paper (see, [Radford et al., 2021](https://arxiv.org/abs/2103.00020)). However, considering that our results are in line with those obtained by mCLIP we think that
+the translated image labels might have had an impact on the final scores.
+# Team members
 - Federico Bianchi ([vinid](https://huggingface.co/vinid))
 - Raphael Pisoni ([4rtemi5](https://huggingface.co/4rtemi5))
 - Giuseppe Attanasio ([g8a9](https://huggingface.co/g8a9))
 - Silvia Terragni ([silviatti](https://huggingface.co/silviatti))
 - Dario Balestri ([D3Reo](https://huggingface.co/D3Reo))
 - Gabriele Sarti ([gsarti](https://huggingface.co/gsarti))
+- Sri Lakshmi ([srisweet](https://huggingface.co/srisweet))