rskuzma commited on
Commit
a8cb8f2
1 Parent(s): 7ecd0af

fixed typo

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -107,7 +107,7 @@ Recent works find significant duplicate data present in the Pile. Eleuther’s P
107
 
108
  We use the GPT-3 style model architecture. All of our layers use full attention as opposed to the GPT-3 style sparse banded attention. The model shapes were selected to either follow aspect ratio 80 or are the same shape as GPT-3 models. Learning rate warmed up for 375M tokens (1500 steps for 111M and 256M models) and 10x cosine decayed. No dropout was used and weight decay was set to 0.1. All models are trained with MSL of 2048.
109
 
110
- All models were trained to Chinchilla point: 20x more tokens than model parameters. Number of steps changed based on fixed batch size (2048) and sequence length (varied by model). See Training Table, below, for detail.
111
 
112
  <br>
113
 
 
107
 
108
  We use the GPT-3 style model architecture. All of our layers use full attention as opposed to the GPT-3 style sparse banded attention. The model shapes were selected to either follow aspect ratio 80 or are the same shape as GPT-3 models. Learning rate warmed up for 375M tokens (1500 steps for 111M and 256M models) and 10x cosine decayed. No dropout was used and weight decay was set to 0.1. All models are trained with MSL of 2048.
109
 
110
+ All models were trained to Chinchilla point: 20 tokens per model parameter. Number of steps was chosen based on optimal batch size (varied by model) and fixed sequence length (2048). See Training Table, below, for details.
111
 
112
  <br>
113