avi-skowron commited on
Commit
ac06357
1 Parent(s): 3c76473

Add evaluations

Browse files
Files changed (1) hide show
  1. README.md +39 -19
README.md CHANGED
@@ -21,13 +21,13 @@ same data, in the exact same order. All Pythia models are available
21
  The Pythia model suite was deliberately designed to promote scientific
22
  research on large language models, especially interpretability research.
23
  Despite not centering downstream performance as a design goal, we find the
24
- models match or exceed the performance of similar and same-sized models,
25
- such as those in the OPT and GPT-Neo suites.
26
 
27
  Please note that all models in the *Pythia* suite were renamed in January
28
  2023. For clarity, a <a href="#naming-convention-and-parameter-count">table
29
  comparing the old and new names</a> is provided in this model card, together
30
- with exact model parameter counts.
31
 
32
  ## Pythia-12B-deduped
33
 
@@ -143,8 +143,7 @@ tokenizer.decode(tokens[0])
143
  ```
144
 
145
  Revision/branch `step143000` corresponds exactly to the model checkpoint on
146
- the `main` branch of each model.
147
-
148
  For more information on how to use all Pythia models, see [documentation on
149
  GitHub](https://github.com/EleutherAI/pythia).
150
 
@@ -153,8 +152,7 @@ GitHub](https://github.com/EleutherAI/pythia).
153
  #### Training data
154
 
155
  Pythia-12B-deduped was trained on the Pile **after the dataset has been
156
- globally deduplicated**.
157
-
158
  [The Pile](https://pile.eleuther.ai/) is a 825GiB general-purpose dataset in
159
  English. It was created by EleutherAI specifically for training large language
160
  models. It contains texts from 22 diverse sources, roughly broken down into
@@ -170,9 +168,6 @@ mirror](https://the-eye.eu/public/AI/pile/).
170
 
171
  #### Training procedure
172
 
173
- Pythia uses the same tokenizer as [GPT-NeoX-
174
- 20B](https://huggingface.co/EleutherAI/gpt-neox-20b).
175
-
176
  All models were trained on the exact same data, in the exact same order. Each
177
  model saw 299,892,736,000 tokens during training, and 143 checkpoints for each
178
  model are saved every 2,097,152,000 tokens, spaced evenly throughout training.
@@ -186,21 +181,46 @@ checkpoints every 500 steps. The checkpoints on Hugging Face are renamed for
186
  consistency with all 2M batch models, so `step1000` is the first checkpoint
187
  for `pythia-1.4b` that was saved (corresponding to step 500 in training), and
188
  `step1000` is likewise the first `pythia-6.9b` checkpoint that was saved
189
- (corresponding to 1000 “actual” steps).
190
-
191
- See [GitHub](https://github.com/EleutherAI/pythia) for more details on training
192
- procedure, including [how to reproduce
193
- it](https://github.com/EleutherAI/pythia/blob/main/README.md#reproducing-training).
 
194
 
195
  ### Evaluations
196
 
197
  All 16 *Pythia* models were evaluated using the [LM Evaluation
198
  Harness](https://github.com/EleutherAI/lm-evaluation-harness). You can access
199
  the results by model and step at `results/json/*` in the [GitHub
200
- repository](https://github.com/EleutherAI/pythia/tree/main/results/json).
201
-
202
- February 2023 note: select evaluations and comparison with OPT and BLOOM
203
- models will be added here at a later date.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
204
 
205
  ### Naming convention and parameter count
206
 
 
21
  The Pythia model suite was deliberately designed to promote scientific
22
  research on large language models, especially interpretability research.
23
  Despite not centering downstream performance as a design goal, we find the
24
+ models <a href="#evaluations">match or exceed</a> the performance of
25
+ similar and same-sized models, such as those in the OPT and GPT-Neo suites.
26
 
27
  Please note that all models in the *Pythia* suite were renamed in January
28
  2023. For clarity, a <a href="#naming-convention-and-parameter-count">table
29
  comparing the old and new names</a> is provided in this model card, together
30
+ with exact parameter counts.
31
 
32
  ## Pythia-12B-deduped
33
 
 
143
  ```
144
 
145
  Revision/branch `step143000` corresponds exactly to the model checkpoint on
146
+ the `main` branch of each model.<br>
 
147
  For more information on how to use all Pythia models, see [documentation on
148
  GitHub](https://github.com/EleutherAI/pythia).
149
 
 
152
  #### Training data
153
 
154
  Pythia-12B-deduped was trained on the Pile **after the dataset has been
155
+ globally deduplicated**.<br>
 
156
  [The Pile](https://pile.eleuther.ai/) is a 825GiB general-purpose dataset in
157
  English. It was created by EleutherAI specifically for training large language
158
  models. It contains texts from 22 diverse sources, roughly broken down into
 
168
 
169
  #### Training procedure
170
 
 
 
 
171
  All models were trained on the exact same data, in the exact same order. Each
172
  model saw 299,892,736,000 tokens during training, and 143 checkpoints for each
173
  model are saved every 2,097,152,000 tokens, spaced evenly throughout training.
 
181
  consistency with all 2M batch models, so `step1000` is the first checkpoint
182
  for `pythia-1.4b` that was saved (corresponding to step 500 in training), and
183
  `step1000` is likewise the first `pythia-6.9b` checkpoint that was saved
184
+ (corresponding to 1000 “actual” steps).<br>
185
+ See [GitHub](https://github.com/EleutherAI/pythia) for more details on training
186
+ procedure, including [how to reproduce
187
+ it](https://github.com/EleutherAI/pythia/blob/main/README.md#reproducing-training).<br>
188
+ Pythia uses the same tokenizer as [GPT-NeoX-
189
+ 20B](https://huggingface.co/EleutherAI/gpt-neox-20b).
190
 
191
  ### Evaluations
192
 
193
  All 16 *Pythia* models were evaluated using the [LM Evaluation
194
  Harness](https://github.com/EleutherAI/lm-evaluation-harness). You can access
195
  the results by model and step at `results/json/*` in the [GitHub
196
+ repository](https://github.com/EleutherAI/pythia/tree/main/results/json).<br>
197
+ Expand the sections below to see plots of evaluation results for all
198
+ Pythia and Pythia-deduped models compared with OPT and BLOOM.
199
+
200
+ <details>
201
+ <summary>LAMBADA – OpenAI</summary>
202
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/lambada_openai.png" style="width:auto"/>
203
+ </details>
204
+
205
+ <details>
206
+ <summary>Physical Interaction: Question Answering (PIQA)</summary>
207
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/piqa.png" style="width:auto"/>
208
+ </details>
209
+
210
+ <details>
211
+ <summary>WinoGrande</summary>
212
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/winogrande.png" style="width:auto"/>
213
+ </details>
214
+
215
+ <details>
216
+ <summary>AI2 Reasoning Challenge—Challenge Set</summary>
217
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/arc_challenge.png" style="width:auto"/>
218
+ </details>
219
+
220
+ <details>
221
+ <summary>SciQ</summary>
222
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/sciq.png" style="width:auto"/>
223
+ </details>
224
 
225
  ### Naming convention and parameter count
226