vinid commited on
Commit
3778721
1 Parent(s): c702c34

added some details to the readme

Browse files
Files changed (1) hide show
  1. introduction.md +14 -7
introduction.md CHANGED
@@ -70,8 +70,8 @@ We knew that without a good augmentation strategy we could never get competitive
70
  ## Better Training
71
 
72
  After different trials, we realized that the usual way of training this model was
73
- not good enough to get good results. We thus modified two different parts of the
74
- training pipeline: the optimizer and the training with frozen components.
75
 
76
  ### Optimizer
77
 
@@ -83,7 +83,14 @@ Our implementation is available online [here](https://github.com/clip-italian/cl
83
  The ViT used by OpenAI was already trained on 400 million images and it is the element in our architecture that probably required less training.
84
  The same is true for the BERT model we use. To allow the randomly initialized Re-projection Layers to warm up without messing with the tuned weights of the backbones we decided to do a first training with the backbones of our architecture completely frozen. Only after these layers converged we unfreezed the rest of the model to fine-tune all the components. This technique allowed us to reach a much better validation loss.
85
 
86
- <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="80%"/>
 
 
 
 
 
 
 
87
 
88
  # Scientific Validity
89
 
@@ -121,12 +128,12 @@ we use the MRR.
121
  It is true that we used MSCOCO-IT in training, and this might give us an advantage. However the original CLIP model was trained
122
  on 400million images (and some of them probably were from MSCOCO).
123
 
124
- [Colab: Image Retrieval Evaluation](https://colab.research.google.com/drive/1bLVwVKpAndpEDHqjzxVPr_9nGrSbuOQd?usp=sharing)
125
 
126
  ### Zero-shot image classification
127
 
128
- This experiment replicates the original one run by OpenAI on zero-shot image classification on ImageNet. To do this, we used DeepL to
129
- translate the image labels in ImageNet with DeepL. We evaluate the models computing the accuracy.
130
 
131
 
132
  | Accuracy | CLIP-Italian | mCLIP |
@@ -136,7 +143,7 @@ translate the image labels in ImageNet with DeepL. We evaluate the models comput
136
  | Accuracy@10 | **52.55** | 42.91 |
137
  | Accuracy@100 | **81.08** | 67.11 |
138
 
139
- [Colab: ImageNet Zero Shot Evaluation](https://colab.research.google.com/drive/1zfWeVWY79XXH63Ci-pk8xxx3Vu_RRgW-?usp=sharing)
140
 
141
  Our results confirm that CLIP-Italian is very competitive and beats mCLIP on the two different task
142
  we have been testing. Note, however, that our results are lower than those shown in the original OpenAI
 
70
  ## Better Training
71
 
72
  After different trials, we realized that the usual way of training this model was
73
+ not good enough to get good results. We thus modified three different parts of the
74
+ training pipeline: the optimizer, the training with frozen components and the logit_scale parameter.
75
 
76
  ### Optimizer
77
 
 
83
  The ViT used by OpenAI was already trained on 400 million images and it is the element in our architecture that probably required less training.
84
  The same is true for the BERT model we use. To allow the randomly initialized Re-projection Layers to warm up without messing with the tuned weights of the backbones we decided to do a first training with the backbones of our architecture completely frozen. Only after these layers converged we unfreezed the rest of the model to fine-tune all the components. This technique allowed us to reach a much better validation loss.
85
 
86
+ <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="90%"/>
87
+
88
+ ### Logit Scale
89
+
90
+ We tried to improve the loss function in different ways: for example, we tried something similar to a margin based loss but that experiments
91
+ didn't go well. Eventually, the thing that worked out the best was fixing the logit_scale value to 20. This value
92
+ is used after the computation of the similarity between the images and the texts in CLIP (see the code [here](https://github.com/clip-italian/clip-italian/blob/master/hybrid_clip/modeling_hybrid_clip.py#L64)).
93
+ We got this idea from Nils' [video](https://youtu.be/RHXZKUr8qOY) on sentence embeddings.
94
 
95
  # Scientific Validity
96
 
 
128
  It is true that we used MSCOCO-IT in training, and this might give us an advantage. However the original CLIP model was trained
129
  on 400million images (and some of them probably were from MSCOCO).
130
 
131
+ You can find the colab to quickly rerun the experiments here: [Colab](https://colab.research.google.com/drive/1bLVwVKpAndpEDHqjzxVPr_9nGrSbuOQd?usp=sharing)
132
 
133
  ### Zero-shot image classification
134
 
135
+ This experiment replicates the original one run by OpenAI on zero-shot image classification on ImageNet.
136
+ To do this, we used DeepL to translate the image labels in ImageNet. We evaluate the models computing the accuracy.
137
 
138
 
139
  | Accuracy | CLIP-Italian | mCLIP |
 
143
  | Accuracy@10 | **52.55** | 42.91 |
144
  | Accuracy@100 | **81.08** | 67.11 |
145
 
146
+ You can find the colab to quickly rerun the experiments here: [ImageNet Zero Shot Evaluation](https://colab.research.google.com/drive/1zfWeVWY79XXH63Ci-pk8xxx3Vu_RRgW-?usp=sharing)
147
 
148
  Our results confirm that CLIP-Italian is very competitive and beats mCLIP on the two different task
149
  we have been testing. Note, however, that our results are lower than those shown in the original OpenAI