Text Generation
Transformers
PyTorch
English
gpt2
causal-lm
text-generation-inference
Inference Endpoints
rskuzma commited on
Commit
81d9a2e
1 Parent(s): 08b60b4
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -107,7 +107,7 @@ Recent works find significant duplicate data present in the Pile. Eleuther’s P
107
 
108
  We use the GPT-3 style model architecture. All of our layers use full attention as opposed to the GPT-3 style sparse banded attention. The model shapes were selected to either follow aspect ratio 80 or are the same shape as GPT-3 models. Learning rate warmed up for 375M tokens (1500 steps for 111M and 256M models) and 10x cosine decayed. No dropout was used and weight decay was set to 0.1. All models are trained with MSL of 2048.
109
 
110
- All models were trained to Chinchilla point: 20x more tokens than model parameters. Number of steps changed based on fixed batch size (2048) and sequence length (varied by model). See Training Table, below, for detail.
111
 
112
  <br>
113
 
 
107
 
108
  We use the GPT-3 style model architecture. All of our layers use full attention as opposed to the GPT-3 style sparse banded attention. The model shapes were selected to either follow aspect ratio 80 or are the same shape as GPT-3 models. Learning rate warmed up for 375M tokens (1500 steps for 111M and 256M models) and 10x cosine decayed. No dropout was used and weight decay was set to 0.1. All models are trained with MSL of 2048.
109
 
110
+ All models were trained to Chinchilla point: 20 tokens per model parameter. Number of steps was chosen based on optimal batch size (varied by model) and fixed sequence length (2048). See Training Table, below, for detail.
111
 
112
  <br>
113