File size: 5,227 Bytes

fc579f8
 
8b1e1e8
fc579f8
 
 
 
 
 
 
 
 
 
 
c0920ff
fc579f8
8b1e1e8
 
c0920ff
fc579f8
c0920ff
 
fc579f8
c0920ff
fc579f8
c0920ff
fc579f8
c0920ff
 
fc579f8
c0920ff
fc579f8
c0920ff
 
fc579f8
c0920ff
fc579f8
c0920ff
fc579f8
a51f290
c0920ff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fc579f8
 
 
 
 
 
c0920ff

---
language: it
license: gpl-3.0
datasets:
- wit
- ctl/conceptualCaptions
- mscoco-it
tags:
- italian
- bert
- vit
- vision
---

# Italian CLIP

Paper: [Contrastive Language-Image Pre-training for the Italian Language](https://arxiv.org/abs/2108.08688)

With a few tricks, we have been able to fine-tune a competitive Italian CLIP model with **only 1.4 million** training samples. Our Italian CLIP model is built upon the [Italian BERT](https://huggingface.co/dbmdz/bert-base-italian-xxl-cased) model provided by [dbmdz](https://huggingface.co/dbmdz) and the OpenAI [vision transformer](https://huggingface.co/openai/clip-vit-base-patch32).

Do you want to test our model right away? We got you covered! You just need to head to our [demo application](https://huggingface.co/spaces/clip-italian/clip-italian-demo).
The demo also contains all the details of the project, from training tricks to our most impressive results, and much more!

# Training data

We considered four main sources of data:

+ [WIT](https://github.com/google-research-datasets/wit) is an image-caption dataset collected from Wikipedia (see, 
[Srinivasan et al., 2021](https://arxiv.org/pdf/2103.01913.pdf)).

+ [MSCOCO-IT](https://github.com/crux82/mscoco-it). This image-caption dataset comes from the work by [Scaiella et al., 2019](http://www.ai-lc.it/IJCoL/v5n2/IJCOL_5_2_3___scaiella_et_al.pdf).

+ [Conceptual Captions](https://ai.google.com/research/ConceptualCaptions/). This image-caption dataset comes from 
the work by [Sharma et al., 2018](https://aclanthology.org/P18-1238.pdf).

+ [La Foto del Giorno](https://www.ilpost.it/foto-del-giorno/). This image-caption dataset is collected from [Il Post](https://www.ilpost.it/), a prominent Italian online newspaper.

We used better data augmentation, strategic training choices (we have way less data than the original CLIP paper), and backbone-freezing pre-training. For all the details on that, please refer to our [demo](https://huggingface.co/spaces/clip-italian/clip-italian-demo).

# Experiments

## Quantitative Evaluation

To better understand how well our clip-italian model works we run an experimental evaluation. Since this is the first clip-based model in Italian, we used the multilingual CLIP model as a comparison baseline. 

### mCLIP

The multilingual CLIP (henceforth, mCLIP), is a model introduced by [Nils Reimers](https://www.sbert.net/docs/pretrained_models.html) in his
[sentence-transformer](https://www.sbert.net/index.html) library. mCLIP is based on a multilingual encoder
that was created through multilingual knowledge distillation (see [Reimers et al., 2020](https://aclanthology.org/2020.emnlp-main.365/)).

### Tasks

We selected two different tasks: 
+ image-retrieval 
+ zero-shot classification

### Reproducibiliy

Both experiments should be very easy to replicate, we share the two colab notebook we used to compute the two results

+ [Image Retrieval](https://colab.research.google.com/drive/1bLVwVKpAndpEDHqjzxVPr_9nGrSbuOQd?usp=sharing)
+ [ImageNet Zero Shot Evaluation](https://colab.research.google.com/drive/1zfWeVWY79XXH63Ci-pk8xxx3Vu_RRgW-?usp=sharing)


### Image Retrieval

This experiment is run against the MSCOCO-IT validation set (that we haven't used in training). Given in input
a caption, we search for the most similar image in the MSCOCO-IT validation set. As evaluation metrics
we use the MRR@K.

| MRR             | CLIP-Italian | mCLIP |
| --------------- | ------------ |-------|
| MRR@1           | **0.3797**   | 0.2874|   
| MRR@5           | **0.5039**   | 0.3957|
| MRR@10          | **0.5204**   | 0.4129|

It is true that we used MSCOCO-IT in training, and this might give us an advantage. However the original CLIP model was trained
on 400million images (and some of them probably were from MSCOCO).


### Zero-shot image classification

This experiment replicates the original one run by OpenAI on zero-shot image classification on ImageNet. 
To do this, we used DeepL to translate the image labels in ImageNet. We evaluate the models computing the accuracy at different levels. 


| Accuracy        | CLIP-Italian | mCLIP |
| --------------- | ------------ |-------|
| Accuracy@1      |  **22.11**   | 20.15 |   
| Accuracy@5      |  **43.69**   | 36.57 |
| Accuracy@10     |  **52.55**   | 42.91 |
| Accuracy@100    |  **81.08**   | 67.11 |

Our results confirm that CLIP-Italian is very competitive and beats mCLIP on the two different task
we have been testing. Note, however, that our results are lower than those shown in the original OpenAI
paper (see, [Radford et al., 2021](https://arxiv.org/abs/2103.00020)). However, considering that our results are in line with those obtained by mCLIP we think that 
the translated image labels might have had an impact on the final scores.


# Team members
- Federico Bianchi ([vinid](https://huggingface.co/vinid))
- Raphael Pisoni ([4rtemi5](https://huggingface.co/4rtemi5))
- Giuseppe Attanasio ([g8a9](https://huggingface.co/g8a9))
- Silvia Terragni ([silviatti](https://huggingface.co/silviatti))
- Dario Balestri ([D3Reo](https://huggingface.co/D3Reo))
- Gabriele Sarti ([gsarti](https://huggingface.co/gsarti))
- Sri Lakshmi ([srisweet](https://huggingface.co/srisweet))