Spaces:
Running
Running
added some details to the readme
Browse files- introduction.md +14 -7
introduction.md
CHANGED
@@ -70,8 +70,8 @@ We knew that without a good augmentation strategy we could never get competitive
|
|
70 |
## Better Training
|
71 |
|
72 |
After different trials, we realized that the usual way of training this model was
|
73 |
-
not good enough to get good results. We thus modified
|
74 |
-
training pipeline: the optimizer
|
75 |
|
76 |
### Optimizer
|
77 |
|
@@ -83,7 +83,14 @@ Our implementation is available online [here](https://github.com/clip-italian/cl
|
|
83 |
The ViT used by OpenAI was already trained on 400 million images and it is the element in our architecture that probably required less training.
|
84 |
The same is true for the BERT model we use. To allow the randomly initialized Re-projection Layers to warm up without messing with the tuned weights of the backbones we decided to do a first training with the backbones of our architecture completely frozen. Only after these layers converged we unfreezed the rest of the model to fine-tune all the components. This technique allowed us to reach a much better validation loss.
|
85 |
|
86 |
-
<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
87 |
|
88 |
# Scientific Validity
|
89 |
|
@@ -121,12 +128,12 @@ we use the MRR.
|
|
121 |
It is true that we used MSCOCO-IT in training, and this might give us an advantage. However the original CLIP model was trained
|
122 |
on 400million images (and some of them probably were from MSCOCO).
|
123 |
|
124 |
-
|
125 |
|
126 |
### Zero-shot image classification
|
127 |
|
128 |
-
This experiment replicates the original one run by OpenAI on zero-shot image classification on ImageNet.
|
129 |
-
translate the image labels in ImageNet
|
130 |
|
131 |
|
132 |
| Accuracy | CLIP-Italian | mCLIP |
|
@@ -136,7 +143,7 @@ translate the image labels in ImageNet with DeepL. We evaluate the models comput
|
|
136 |
| Accuracy@10 | **52.55** | 42.91 |
|
137 |
| Accuracy@100 | **81.08** | 67.11 |
|
138 |
|
139 |
-
|
140 |
|
141 |
Our results confirm that CLIP-Italian is very competitive and beats mCLIP on the two different task
|
142 |
we have been testing. Note, however, that our results are lower than those shown in the original OpenAI
|
|
|
70 |
## Better Training
|
71 |
|
72 |
After different trials, we realized that the usual way of training this model was
|
73 |
+
not good enough to get good results. We thus modified three different parts of the
|
74 |
+
training pipeline: the optimizer, the training with frozen components and the logit_scale parameter.
|
75 |
|
76 |
### Optimizer
|
77 |
|
|
|
83 |
The ViT used by OpenAI was already trained on 400 million images and it is the element in our architecture that probably required less training.
|
84 |
The same is true for the BERT model we use. To allow the randomly initialized Re-projection Layers to warm up without messing with the tuned weights of the backbones we decided to do a first training with the backbones of our architecture completely frozen. Only after these layers converged we unfreezed the rest of the model to fine-tune all the components. This technique allowed us to reach a much better validation loss.
|
85 |
|
86 |
+
<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="90%"/>
|
87 |
+
|
88 |
+
### Logit Scale
|
89 |
+
|
90 |
+
We tried to improve the loss function in different ways: for example, we tried something similar to a margin based loss but that experiments
|
91 |
+
didn't go well. Eventually, the thing that worked out the best was fixing the logit_scale value to 20. This value
|
92 |
+
is used after the computation of the similarity between the images and the texts in CLIP (see the code [here](https://github.com/clip-italian/clip-italian/blob/master/hybrid_clip/modeling_hybrid_clip.py#L64)).
|
93 |
+
We got this idea from Nils' [video](https://youtu.be/RHXZKUr8qOY) on sentence embeddings.
|
94 |
|
95 |
# Scientific Validity
|
96 |
|
|
|
128 |
It is true that we used MSCOCO-IT in training, and this might give us an advantage. However the original CLIP model was trained
|
129 |
on 400million images (and some of them probably were from MSCOCO).
|
130 |
|
131 |
+
You can find the colab to quickly rerun the experiments here: [Colab](https://colab.research.google.com/drive/1bLVwVKpAndpEDHqjzxVPr_9nGrSbuOQd?usp=sharing)
|
132 |
|
133 |
### Zero-shot image classification
|
134 |
|
135 |
+
This experiment replicates the original one run by OpenAI on zero-shot image classification on ImageNet.
|
136 |
+
To do this, we used DeepL to translate the image labels in ImageNet. We evaluate the models computing the accuracy.
|
137 |
|
138 |
|
139 |
| Accuracy | CLIP-Italian | mCLIP |
|
|
|
143 |
| Accuracy@10 | **52.55** | 42.91 |
|
144 |
| Accuracy@100 | **81.08** | 67.11 |
|
145 |
|
146 |
+
You can find the colab to quickly rerun the experiments here: [ImageNet Zero Shot Evaluation](https://colab.research.google.com/drive/1zfWeVWY79XXH63Ci-pk8xxx3Vu_RRgW-?usp=sharing)
|
147 |
|
148 |
Our results confirm that CLIP-Italian is very competitive and beats mCLIP on the two different task
|
149 |
we have been testing. Note, however, that our results are lower than those shown in the original OpenAI
|