Spaces:

clip-italian
/

clip-italian-demo

Running

Silvia Terragni commited on Jul 24, 2021

Commit

413a2a7

•

1 Parent(s): 62daef8

fixed typo

Files changed (1) hide show

introduction.md CHANGED Viewed

@@ -1,7 +1,7 @@
 CLIP-Italian is a **multimodal** model trained on **~1.4 Million** Italian text-image pairs using **Italian Bert** model as text encoder and Vision Transformer **ViT** as image encoder using the **JAX/Flax** neural network library. The training was carried out during the **Hugging Face** Community event on **Google's TPU** machines, sponsored by **Google Cloud**.
-Clip-Italian (Contrastive Language-Image Pre-training in Italian language) is based on OpenAI’s CLIP ([Radford et al., 2021](https://arxiv.org/abs/2103.00020))which is an amazing model that can learn to represent images and text jointly in the same space.
 In this project, we aim to propose the first CLIP model trained on Italian data, that in this context can be considered a
 low resource language. Using a few techniques, we have been able to fine-tune a SOTA Italian CLIP model with **only 1.4M** training samples. Our Italian CLIP model
@@ -37,6 +37,7 @@ different applications that can start from here.
 The original CLIP model was trained on 400 million image-text pairs; this amount of data is currently not available for Italian.
 We indeed worked in a **low-resource setting**. The only datasets for Italian captioning in the literature are MSCOCO-IT (a translated version of MSCOCO) and WIT.
 To get competitive results, we followed three strategies:
   1. more and better data;
   2. better augmentations;
   3. better training strategies.

 CLIP-Italian is a **multimodal** model trained on **~1.4 Million** Italian text-image pairs using **Italian Bert** model as text encoder and Vision Transformer **ViT** as image encoder using the **JAX/Flax** neural network library. The training was carried out during the **Hugging Face** Community event on **Google's TPU** machines, sponsored by **Google Cloud**.
+Clip-Italian (Contrastive Language-Image Pre-training in Italian language) is based on OpenAI’s CLIP ([Radford et al., 2021](https://arxiv.org/abs/2103.00020)) which is an amazing model that can learn to represent images and text jointly in the same space.
 In this project, we aim to propose the first CLIP model trained on Italian data, that in this context can be considered a
 low resource language. Using a few techniques, we have been able to fine-tune a SOTA Italian CLIP model with **only 1.4M** training samples. Our Italian CLIP model
 The original CLIP model was trained on 400 million image-text pairs; this amount of data is currently not available for Italian.
 We indeed worked in a **low-resource setting**. The only datasets for Italian captioning in the literature are MSCOCO-IT (a translated version of MSCOCO) and WIT.
 To get competitive results, we followed three strategies:
   1. more and better data;
   2. better augmentations;
   3. better training strategies.