4rtemi5 commited on
Commit
70eefaa
1 Parent(s): 0789e97

Update introduction.md

Browse files
Files changed (1) hide show
  1. introduction.md +5 -5
introduction.md CHANGED
@@ -54,6 +54,8 @@ a dataset with 700K translated captions.
54
 
55
  ## Better Augmentations
56
 
 
 
57
  ## Better Training
58
 
59
  After different trials, we realized that the usual way of training this model was
@@ -62,17 +64,15 @@ training pipeline: the optimizer and the training with frozen components.
62
 
63
  ### Optimizer
64
 
65
- The standard AdamW didn't seem enough to train the model and thus we opted for a different optimization strategy. We eventually used AdaBelief with AGC and Cosine Annealing.
66
  Our implementation is available online [here](https://github.com/clip-italian/clip-italian/blob/master/hybrid_clip/run_hybrid_clip.py#L667).
67
 
68
  ### Backbone Freezing
69
 
70
  The ViT used by OpenAI was already trained on 400million images and it is the element in our architecture that probably required less training.
71
- The same is true for the BERT model we use. Thus, we decided to do a first training with the backbone of our architecture completely frozen, to allow
72
- the deeper layer to adapt to the new setting. Eventually, we run a new training, by fine-tuning al the components. This technique allowed us to
73
- reach a much better validation loss.
74
 
75
- <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="600"/>
76
 
77
  # Scientific Validity
78
 
 
54
 
55
  ## Better Augmentations
56
 
57
+ We knew that without a good augmentation strategy we could never get competitive results to a model trained on 400 million images. Therefor we implemented heavy augmentations to make the training more data efficient. We made sure to keep hue augmentations limited however to still give the model the ability to learn color definitions. While we would have liked to have augmentations for the captions as well after some experimentation we settled with random sampling from the five captions available in MSCOCO and leaving the rest of the captions unmodified.
58
+
59
  ## Better Training
60
 
61
  After different trials, we realized that the usual way of training this model was
 
64
 
65
  ### Optimizer
66
 
67
+ While the initial code used AdamW as an optimizer we soon noticed that it introduced some bad properties into the training. The model strated to overfit relatively quickly and the weight decay made this effect worse. We eventually decided to an optimization strategy that had worked well for us in similar cases and used AdaBelief with Adaptive Gradient Clipping (AGC) and a Cosine Annealing Schedule. Together with slightly tuning the learning rate this helped us to reduce the validation loss by 25%.
68
  Our implementation is available online [here](https://github.com/clip-italian/clip-italian/blob/master/hybrid_clip/run_hybrid_clip.py#L667).
69
 
70
  ### Backbone Freezing
71
 
72
  The ViT used by OpenAI was already trained on 400million images and it is the element in our architecture that probably required less training.
73
+ The same is true for the BERT model we use. To allow the randomly initialized Re-projection Layers to warm up without messing with the tuned weights of the backbones we decided to do a first training with the backbones of our architecture completely frozen. Only after these layers converged did we unfreeze the rest of the model to fine-tune all the components. This technique allowed us to reach a much better validation loss.
 
 
74
 
75
+ <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="50%"/>
76
 
77
  # Scientific Validity
78