Silvia Terragni commited on
Commit
413a2a7
1 Parent(s): 62daef8

fixed typo

Browse files
Files changed (1) hide show
  1. introduction.md +2 -1
introduction.md CHANGED
@@ -1,7 +1,7 @@
1
 
2
  CLIP-Italian is a **multimodal** model trained on **~1.4 Million** Italian text-image pairs using **Italian Bert** model as text encoder and Vision Transformer **ViT** as image encoder using the **JAX/Flax** neural network library. The training was carried out during the **Hugging Face** Community event on **Google's TPU** machines, sponsored by **Google Cloud**.
3
 
4
- Clip-Italian (Contrastive Language-Image Pre-training in Italian language) is based on OpenAI’s CLIP ([Radford et al., 2021](https://arxiv.org/abs/2103.00020))which is an amazing model that can learn to represent images and text jointly in the same space.
5
 
6
  In this project, we aim to propose the first CLIP model trained on Italian data, that in this context can be considered a
7
  low resource language. Using a few techniques, we have been able to fine-tune a SOTA Italian CLIP model with **only 1.4M** training samples. Our Italian CLIP model
@@ -37,6 +37,7 @@ different applications that can start from here.
37
  The original CLIP model was trained on 400 million image-text pairs; this amount of data is currently not available for Italian.
38
  We indeed worked in a **low-resource setting**. The only datasets for Italian captioning in the literature are MSCOCO-IT (a translated version of MSCOCO) and WIT.
39
  To get competitive results, we followed three strategies:
 
40
  1. more and better data;
41
  2. better augmentations;
42
  3. better training strategies.
 
1
 
2
  CLIP-Italian is a **multimodal** model trained on **~1.4 Million** Italian text-image pairs using **Italian Bert** model as text encoder and Vision Transformer **ViT** as image encoder using the **JAX/Flax** neural network library. The training was carried out during the **Hugging Face** Community event on **Google's TPU** machines, sponsored by **Google Cloud**.
3
 
4
+ Clip-Italian (Contrastive Language-Image Pre-training in Italian language) is based on OpenAI’s CLIP ([Radford et al., 2021](https://arxiv.org/abs/2103.00020)) which is an amazing model that can learn to represent images and text jointly in the same space.
5
 
6
  In this project, we aim to propose the first CLIP model trained on Italian data, that in this context can be considered a
7
  low resource language. Using a few techniques, we have been able to fine-tune a SOTA Italian CLIP model with **only 1.4M** training samples. Our Italian CLIP model
 
37
  The original CLIP model was trained on 400 million image-text pairs; this amount of data is currently not available for Italian.
38
  We indeed worked in a **low-resource setting**. The only datasets for Italian captioning in the literature are MSCOCO-IT (a translated version of MSCOCO) and WIT.
39
  To get competitive results, we followed three strategies:
40
+
41
  1. more and better data;
42
  2. better augmentations;
43
  3. better training strategies.