VictorSanh commited on
Commit
cb68b97
1 Parent(s): 1eaac25
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -116,7 +116,7 @@ The model is trained on the following data mixture of openly accessible English
116
 
117
  For multimodal web documents, we feed the model sequences corresponding to the succession of text paragraphs and images. For image-text pairs, we form the training sequences by packing images with their captions. The images are encoded with the vision encoder and vision hidden states are pooled with Transformer Perceiver blocks and then fused into the text sequence through the cross-attention blocks.
118
 
119
- Following (Dehghani et al., 2023)[https://huggingface.co/papers/2302.05442], we apply a layer normalization on the projected queries and keys of both the Perceiver and cross-attention blocks, which improved training stability in our early experiments. We use the [RMSNorm](https://huggingface.co/papers/1910.07467) implementation for trainable Layer Norms.
120
 
121
  The training objective is the standard next token prediction.
122
 
 
116
 
117
  For multimodal web documents, we feed the model sequences corresponding to the succession of text paragraphs and images. For image-text pairs, we form the training sequences by packing images with their captions. The images are encoded with the vision encoder and vision hidden states are pooled with Transformer Perceiver blocks and then fused into the text sequence through the cross-attention blocks.
118
 
119
+ Following [Dehghani et al., 2023](https://huggingface.co/papers/2302.05442), we apply a layer normalization on the projected queries and keys of both the Perceiver and cross-attention blocks, which improved training stability in our early experiments. We use the [RMSNorm](https://huggingface.co/papers/1910.07467) implementation for trainable Layer Norms.
120
 
121
  The training objective is the standard next token prediction.
122