Update README.md
Browse files
README.md
CHANGED
@@ -61,7 +61,11 @@ Users (both direct and downstream) should be made aware of the risks, biases and
|
|
61 |
|
62 |
# Training
|
63 |
|
64 |
-
|
|
|
|
|
|
|
|
|
65 |
|
66 |
The model developers also write that:
|
67 |
|
|
|
61 |
|
62 |
# Training
|
63 |
|
64 |
+
The model developers write:
|
65 |
+
|
66 |
+
> In all experiments, we use a Transformer architecture with 1024 hidden units, 8 heads, GELU activations (Hendrycks and Gimpel, 2016), a dropout rate of 0.1 and learned positional embeddings. We train our models with the Adam op- timizer (Kingma and Ba, 2014), a linear warm- up (Vaswani et al., 2017) and learning rates varying from 10^−4 to 5.10^−4.
|
67 |
+
|
68 |
+
See the [associated paper](https://arxiv.org/pdf/1901.07291.pdf) for links, citations, and further details on the training data and training procedure.
|
69 |
|
70 |
The model developers also write that:
|
71 |
|