Spaces:

clip-italian
/

clip-italian-demo

Running

App Files Files Community

vinid commited on Jul 19, 2021

Commit

3778721

•

1 Parent(s): c702c34

added some details to the readme

Browse files

Files changed (1) hide show

introduction.md +14 -7

introduction.md CHANGED Viewed

@@ -70,8 +70,8 @@ We knew that without a good augmentation strategy we could never get competitive
 ## Better Training
 After different trials, we realized that the usual way of training this model was
-not good enough to get good results. We thus modified two different parts of the
-training pipeline: the optimizer and the training with frozen components.
 ### Optimizer
@@ -83,7 +83,14 @@ Our implementation is available online [here](https://github.com/clip-italian/cl
 The ViT used by OpenAI was already trained on 400 million images and it is the element in our architecture that probably required less training.
 The same is true for the BERT model we use. To allow the randomly initialized Re-projection Layers to warm up without messing with the tuned weights of the backbones we decided to do a first training with the backbones of our architecture completely frozen. Only after these layers converged we unfreezed the rest of the model to fine-tune all the components. This technique allowed us to reach a much better validation loss.
-<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="80%"/>
 # Scientific Validity
@@ -121,12 +128,12 @@ we use the MRR.
 It is true that we used MSCOCO-IT in training, and this might give us an advantage. However the original CLIP model was trained
 on 400million images (and some of them probably were from MSCOCO).
-[Colab: Image Retrieval Evaluation](https://colab.research.google.com/drive/1bLVwVKpAndpEDHqjzxVPr_9nGrSbuOQd?usp=sharing)
 ### Zero-shot image classification
-This experiment replicates the original one run by OpenAI on zero-shot image classification on ImageNet. To do this, we used DeepL to
-translate the image labels in ImageNet with DeepL. We evaluate the models computing the accuracy.
 | Accuracy        | CLIP-Italian | mCLIP |
@@ -136,7 +143,7 @@ translate the image labels in ImageNet with DeepL. We evaluate the models comput
 | Accuracy@10     |  **52.55**   | 42.91 |
 | Accuracy@100    |  **81.08**   | 67.11 |
-[Colab: ImageNet Zero Shot Evaluation](https://colab.research.google.com/drive/1zfWeVWY79XXH63Ci-pk8xxx3Vu_RRgW-?usp=sharing)
 Our results confirm that CLIP-Italian is very competitive and beats mCLIP on the two different task
 we have been testing. Note, however, that our results are lower than those shown in the original OpenAI

 ## Better Training
 After different trials, we realized that the usual way of training this model was
+not good enough to get good results. We thus modified three different parts of the
+training pipeline: the optimizer, the training with frozen components and the logit_scale parameter.
 ### Optimizer
 The ViT used by OpenAI was already trained on 400 million images and it is the element in our architecture that probably required less training.
 The same is true for the BERT model we use. To allow the randomly initialized Re-projection Layers to warm up without messing with the tuned weights of the backbones we decided to do a first training with the backbones of our architecture completely frozen. Only after these layers converged we unfreezed the rest of the model to fine-tune all the components. This technique allowed us to reach a much better validation loss.
+<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="90%"/>
+### Logit Scale
+We tried to improve the loss function in different ways: for example, we tried something similar to a margin based loss but that experiments
+didn't go well. Eventually, the thing that worked out the best was fixing the logit_scale value to 20. This value
+is used after the computation of the similarity between the images and the texts in CLIP (see the code [here](https://github.com/clip-italian/clip-italian/blob/master/hybrid_clip/modeling_hybrid_clip.py#L64)).
+We got this idea from Nils' [video](https://youtu.be/RHXZKUr8qOY) on sentence embeddings.
 # Scientific Validity
 It is true that we used MSCOCO-IT in training, and this might give us an advantage. However the original CLIP model was trained
 on 400million images (and some of them probably were from MSCOCO).
+You can find the colab to quickly rerun the experiments here: [Colab](https://colab.research.google.com/drive/1bLVwVKpAndpEDHqjzxVPr_9nGrSbuOQd?usp=sharing)
 ### Zero-shot image classification
+This experiment replicates the original one run by OpenAI on zero-shot image classification on ImageNet.
+To do this, we used DeepL to translate the image labels in ImageNet. We evaluate the models computing the accuracy.
 | Accuracy        | CLIP-Italian | mCLIP |
 | Accuracy@10     |  **52.55**   | 42.91 |
 | Accuracy@100    |  **81.08**   | 67.11 |
+You can find the colab to quickly rerun the experiments here: [ImageNet Zero Shot Evaluation](https://colab.research.google.com/drive/1zfWeVWY79XXH63Ci-pk8xxx3Vu_RRgW-?usp=sharing)
 Our results confirm that CLIP-Italian is very competitive and beats mCLIP on the two different task
 we have been testing. Note, however, that our results are lower than those shown in the original OpenAI