avi-skowron commited on
Commit
1448af2
1 Parent(s): eb601f1

Add evaluations

Browse files

Five eval plots and minor formatting changes

Files changed (1) hide show
  1. README.md +37 -17
README.md CHANGED
@@ -21,13 +21,13 @@ same data, in the exact same order. All Pythia models are available
21
  The Pythia model suite was deliberately designed to promote scientific
22
  research on large language models, especially interpretability research.
23
  Despite not centering downstream performance as a design goal, we find the
24
- models match or exceed the performance of similar and same-sized models,
25
- such as those in the OPT and GPT-Neo suites.
26
 
27
  Please note that all models in the *Pythia* suite were renamed in January
28
  2023. For clarity, a <a href="#naming-convention-and-parameter-count">table
29
  comparing the old and new names</a> is provided in this model card, together
30
- with exact model parameter counts.
31
 
32
  ## Pythia-12B
33
 
@@ -143,8 +143,7 @@ tokenizer.decode(tokens[0])
143
  ```
144
 
145
  Revision/branch `step143000` corresponds exactly to the model checkpoint on
146
- the `main` branch of each model.
147
-
148
  For more information on how to use all Pythia models, see [documentation on
149
  GitHub](https://github.com/EleutherAI/pythia).
150
 
@@ -163,15 +162,11 @@ methodology, and a discussion of ethical implications. Consult [the
163
  datasheet](https://arxiv.org/abs/2201.07311) for more detailed documentation
164
  about the Pile and its component datasets. The Pile can be downloaded from
165
  the [official website](https://pile.eleuther.ai/), or from a [community
166
- mirror](https://the-eye.eu/public/AI/pile/).
167
-
168
  The Pile was **not** deduplicated before being used to train Pythia-12B.
169
 
170
  #### Training procedure
171
 
172
- Pythia uses the same tokenizer as [GPT-NeoX-
173
- 20B](https://huggingface.co/EleutherAI/gpt-neox-20b).
174
-
175
  All models were trained on the exact same data, in the exact same order. Each
176
  model saw 299,892,736,000 tokens during training, and 143 checkpoints for each
177
  model are saved every 2,097,152,000 tokens, spaced evenly throughout training.
@@ -185,21 +180,46 @@ checkpoints every 500 steps. The checkpoints on Hugging Face are renamed for
185
  consistency with all 2M batch models, so `step1000` is the first checkpoint
186
  for `pythia-1.4b` that was saved (corresponding to step 500 in training), and
187
  `step1000` is likewise the first `pythia-6.9b` checkpoint that was saved
188
- (corresponding to 1000 “actual” steps).
189
-
190
  See [GitHub](https://github.com/EleutherAI/pythia) for more details on training
191
  procedure, including [how to reproduce
192
- it](https://github.com/EleutherAI/pythia/blob/main/README.md#reproducing-training).
 
 
193
 
194
  ### Evaluations
195
 
196
  All 16 *Pythia* models were evaluated using the [LM Evaluation
197
  Harness](https://github.com/EleutherAI/lm-evaluation-harness). You can access
198
  the results by model and step at `results/json/*` in the [GitHub
199
- repository](https://github.com/EleutherAI/pythia/tree/main/results/json).
200
-
201
- February 2023 note: select evaluations and comparison with OPT and BLOOM
202
- models will be added here at a later date.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
203
 
204
  ### Naming convention and parameter count
205
 
 
21
  The Pythia model suite was deliberately designed to promote scientific
22
  research on large language models, especially interpretability research.
23
  Despite not centering downstream performance as a design goal, we find the
24
+ models <a href="#evaluations">match or exceed</a> the performance of
25
+ similar and same-sized models, such as those in the OPT and GPT-Neo suites.
26
 
27
  Please note that all models in the *Pythia* suite were renamed in January
28
  2023. For clarity, a <a href="#naming-convention-and-parameter-count">table
29
  comparing the old and new names</a> is provided in this model card, together
30
+ with exact parameter counts.
31
 
32
  ## Pythia-12B
33
 
 
143
  ```
144
 
145
  Revision/branch `step143000` corresponds exactly to the model checkpoint on
146
+ the `main` branch of each model.<br>
 
147
  For more information on how to use all Pythia models, see [documentation on
148
  GitHub](https://github.com/EleutherAI/pythia).
149
 
 
162
  datasheet](https://arxiv.org/abs/2201.07311) for more detailed documentation
163
  about the Pile and its component datasets. The Pile can be downloaded from
164
  the [official website](https://pile.eleuther.ai/), or from a [community
165
+ mirror](https://the-eye.eu/public/AI/pile/).<br>
 
166
  The Pile was **not** deduplicated before being used to train Pythia-12B.
167
 
168
  #### Training procedure
169
 
 
 
 
170
  All models were trained on the exact same data, in the exact same order. Each
171
  model saw 299,892,736,000 tokens during training, and 143 checkpoints for each
172
  model are saved every 2,097,152,000 tokens, spaced evenly throughout training.
 
180
  consistency with all 2M batch models, so `step1000` is the first checkpoint
181
  for `pythia-1.4b` that was saved (corresponding to step 500 in training), and
182
  `step1000` is likewise the first `pythia-6.9b` checkpoint that was saved
183
+ (corresponding to 1000 “actual” steps).<br>
 
184
  See [GitHub](https://github.com/EleutherAI/pythia) for more details on training
185
  procedure, including [how to reproduce
186
+ it](https://github.com/EleutherAI/pythia/blob/main/README.md#reproducing-training).<br>
187
+ Pythia uses the same tokenizer as [GPT-NeoX-
188
+ 20B](https://huggingface.co/EleutherAI/gpt-neox-20b).
189
 
190
  ### Evaluations
191
 
192
  All 16 *Pythia* models were evaluated using the [LM Evaluation
193
  Harness](https://github.com/EleutherAI/lm-evaluation-harness). You can access
194
  the results by model and step at `results/json/*` in the [GitHub
195
+ repository](https://github.com/EleutherAI/pythia/tree/main/results/json).<br>
196
+ Expand the sections below to see plots of evaluation results for all
197
+ Pythia and Pythia-deduped models compared with OPT and BLOOM.
198
+
199
+ <details>
200
+ <summary>LAMBADA – OpenAI</summary>
201
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/lambada_openai.png" style="width:auto"/>
202
+ </details>
203
+
204
+ <details>
205
+ <summary>Physical Interaction: Question Answering (PIQA)</summary>
206
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/piqa.png" style="width:auto"/>
207
+ </details>
208
+
209
+ <details>
210
+ <summary>WinoGrande</summary>
211
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/winogrande.png" style="width:auto"/>
212
+ </details>
213
+
214
+ <details>
215
+ <summary>AI2 Reasoning Challenge—Challenge Set</summary>
216
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/arc_challenge.png" style="width:auto"/>
217
+ </details>
218
+
219
+ <details>
220
+ <summary>SciQ</summary>
221
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/sciq.png" style="width:auto"/>
222
+ </details>
223
 
224
  ### Naming convention and parameter count
225