Marissa commited on
Commit
4b9459f
1 Parent(s): fd7f576

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -1
README.md CHANGED
@@ -61,7 +61,11 @@ Users (both direct and downstream) should be made aware of the risks, biases and
61
 
62
  # Training
63
 
64
- See the [associated paper](https://arxiv.org/pdf/1901.07291.pdf) for details on the training data and training procedure.
 
 
 
 
65
 
66
  The model developers also write that:
67
 
 
61
 
62
  # Training
63
 
64
+ The model developers write:
65
+
66
+ > In all experiments, we use a Transformer architecture with 1024 hidden units, 8 heads, GELU activations (Hendrycks and Gimpel, 2016), a dropout rate of 0.1 and learned positional embeddings. We train our models with the Adam op- timizer (Kingma and Ba, 2014), a linear warm- up (Vaswani et al., 2017) and learning rates varying from 10^−4 to 5.10^−4.
67
+
68
+ See the [associated paper](https://arxiv.org/pdf/1901.07291.pdf) for links, citations, and further details on the training data and training procedure.
69
 
70
  The model developers also write that:
71