avi-skowron commited on
Commit
44b049e
1 Parent(s): c948b8e

Add evaluations

Browse files
Files changed (1) hide show
  1. README.md +50 -30
README.md CHANGED
@@ -16,18 +16,18 @@ interpretability research. It contains two sets of eight models of sizes
16
  models: one trained on the Pile, and one trained on the Pile after the dataset
17
  has been globally deduplicated. All 8 model sizes are trained on the exact
18
  same data, in the exact same order. All Pythia models are available
19
- [on Hugging Face](https://huggingface.co/EleutherAI).
20
 
21
  The Pythia model suite was deliberately designed to promote scientific
22
  research on large language models, especially interpretability research.
23
  Despite not centering downstream performance as a design goal, we find the
24
- models match or exceed the performance of similar and same-sized models,
25
- such as those in the OPT and GPT-Neo suites.
26
 
27
- Please note that all models in the *Pythia* suite were re-named in January
28
  2023. For clarity, a <a href="#naming-convention-and-parameter-count">table
29
  comparing the old and new names</a> is provided in this model card, together
30
- with exact model parameter counts.
31
 
32
  ## Pythia-160M
33
 
@@ -76,16 +76,17 @@ study of how language models change over the course of training, we provide
76
  hosted on Hugging Face as branches. Note that branch `143000` corresponds
77
  exactly to the model checkpoint on the `main` branch of each model.
78
 
79
- You may also further fine-tune and adapt Pythia-160M for deployment, as long as your
80
- use is in accordance with the Apache 2.0 license. Pythia models work with the
81
- Hugging Face [Transformers Library](https://huggingface.co/docs/transformers/index).
82
- If you decide to use pre-trained Pythia-160M as a basis for your
83
- fine-tuned model, please conduct your own risk and bias assessment.
 
84
 
85
  #### Out-of-scope use
86
 
87
  The Pythia Suite is **not** intended for deployment. It is not a in itself
88
- a product, and cannot be used for human-facing interactions.
89
 
90
  Pythia models are English-language only, and are not suitable for translation
91
  or generating text in other languages.
@@ -93,10 +94,9 @@ or generating text in other languages.
93
  Pythia-160M has not been fine-tuned for downstream contexts in which
94
  language models are commonly deployed, such as writing genre prose,
95
  or commercial chatbots. This means Pythia-160M will **not**
96
- respond to a given prompt the way a product like ChatGPT does. This is because, unlike
97
- this model, ChatGPT was fine-tuned using methods such as Reinforcement
98
- Learning from Human Feedback (RLHF) to better “understand” human
99
- instructions.
100
 
101
  #### Limitations and biases
102
 
@@ -143,8 +143,7 @@ tokenizer.decode(tokens[0])
143
  ```
144
 
145
  Revision/branch `step143000` corresponds exactly to the model checkpoint on
146
- the `main` branch of each model.
147
-
148
  For more information on how to use all Pythia models, see [documentation on
149
  GitHub](https://github.com/EleutherAI/pythia).
150
 
@@ -163,15 +162,11 @@ methodology, and a discussion of ethical implications. Consult [the
163
  datasheet](https://arxiv.org/abs/2201.07311) for more detailed documentation
164
  about the Pile and its component datasets. The Pile can be downloaded from
165
  the [official website](https://pile.eleuther.ai/), or from a [community
166
- mirror](https://the-eye.eu/public/AI/pile/).
167
-
168
  The Pile was **not** deduplicated before being used to train Pythia-160M.
169
 
170
  #### Training procedure
171
 
172
- Pythia uses the same tokenizer as [GPT-NeoX-
173
- 20B](https://huggingface.co/EleutherAI/gpt-neox-20b).
174
-
175
  All models were trained on the exact same data, in the exact same order. Each
176
  model saw 299,892,736,000 tokens during training, and 143 checkpoints for each
177
  model are saved every 2,097,152,000 tokens, spaced evenly throughout training.
@@ -185,25 +180,50 @@ checkpoints every 500 steps. The checkpoints on Hugging Face are renamed for
185
  consistency with all 2M batch models, so `step1000` is the first checkpoint
186
  for `pythia-1.4b` that was saved (corresponding to step 500 in training), and
187
  `step1000` is likewise the first `pythia-6.9b` checkpoint that was saved
188
- (corresponding to 1000 “actual” steps).
189
-
190
  See [GitHub](https://github.com/EleutherAI/pythia) for more details on training
191
  procedure, including [how to reproduce
192
- it](https://github.com/EleutherAI/pythia/blob/main/README.md#reproducing-training).
 
 
193
 
194
  ### Evaluations
195
 
196
  All 16 *Pythia* models were evaluated using the [LM Evaluation
197
  Harness](https://github.com/EleutherAI/lm-evaluation-harness). You can access
198
  the results by model and step at `results/json/*` in the [GitHub
199
- repository](https://github.com/EleutherAI/pythia/tree/main/results/json).
200
-
201
- February 2023 note: select evaluations and comparison with OPT and BLOOM
202
- models will be added here at a later date.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
203
 
204
  ### Naming convention and parameter count
205
 
206
- *Pythia* models were re-named in January 2023. It is possible that the old
207
  naming convention still persists in some documentation by accident. The
208
  current naming convention (70M, 160M, etc.) is based on total parameter count.
209
 
 
16
  models: one trained on the Pile, and one trained on the Pile after the dataset
17
  has been globally deduplicated. All 8 model sizes are trained on the exact
18
  same data, in the exact same order. All Pythia models are available
19
+ [on Hugging Face](https://huggingface.co/models?other=pythia).
20
 
21
  The Pythia model suite was deliberately designed to promote scientific
22
  research on large language models, especially interpretability research.
23
  Despite not centering downstream performance as a design goal, we find the
24
+ models <a href="#evaluations">match or exceed</a> the performance of
25
+ similar and same-sized models, such as those in the OPT and GPT-Neo suites.
26
 
27
+ Please note that all models in the *Pythia* suite were renamed in January
28
  2023. For clarity, a <a href="#naming-convention-and-parameter-count">table
29
  comparing the old and new names</a> is provided in this model card, together
30
+ with exact parameter counts.
31
 
32
  ## Pythia-160M
33
 
 
76
  hosted on Hugging Face as branches. Note that branch `143000` corresponds
77
  exactly to the model checkpoint on the `main` branch of each model.
78
 
79
+ You may also further fine-tune and adapt Pythia-160M for deployment,
80
+ as long as your use is in accordance with the Apache 2.0 license. Pythia
81
+ models work with the Hugging Face [Transformers
82
+ Library](https://huggingface.co/docs/transformers/index). If you decide to use
83
+ pre-trained Pythia-160M as a basis for your fine-tuned model, please
84
+ conduct your own risk and bias assessment.
85
 
86
  #### Out-of-scope use
87
 
88
  The Pythia Suite is **not** intended for deployment. It is not a in itself
89
+ a product and cannot be used for human-facing interactions.
90
 
91
  Pythia models are English-language only, and are not suitable for translation
92
  or generating text in other languages.
 
94
  Pythia-160M has not been fine-tuned for downstream contexts in which
95
  language models are commonly deployed, such as writing genre prose,
96
  or commercial chatbots. This means Pythia-160M will **not**
97
+ respond to a given prompt the way a product like ChatGPT does. This is because,
98
+ unlike this model, ChatGPT was fine-tuned using methods such as Reinforcement
99
+ Learning from Human Feedback (RLHF) to better “understand” human instructions.
 
100
 
101
  #### Limitations and biases
102
 
 
143
  ```
144
 
145
  Revision/branch `step143000` corresponds exactly to the model checkpoint on
146
+ the `main` branch of each model.<br>
 
147
  For more information on how to use all Pythia models, see [documentation on
148
  GitHub](https://github.com/EleutherAI/pythia).
149
 
 
162
  datasheet](https://arxiv.org/abs/2201.07311) for more detailed documentation
163
  about the Pile and its component datasets. The Pile can be downloaded from
164
  the [official website](https://pile.eleuther.ai/), or from a [community
165
+ mirror](https://the-eye.eu/public/AI/pile/).<br>
 
166
  The Pile was **not** deduplicated before being used to train Pythia-160M.
167
 
168
  #### Training procedure
169
 
 
 
 
170
  All models were trained on the exact same data, in the exact same order. Each
171
  model saw 299,892,736,000 tokens during training, and 143 checkpoints for each
172
  model are saved every 2,097,152,000 tokens, spaced evenly throughout training.
 
180
  consistency with all 2M batch models, so `step1000` is the first checkpoint
181
  for `pythia-1.4b` that was saved (corresponding to step 500 in training), and
182
  `step1000` is likewise the first `pythia-6.9b` checkpoint that was saved
183
+ (corresponding to 1000 “actual” steps).<br>
 
184
  See [GitHub](https://github.com/EleutherAI/pythia) for more details on training
185
  procedure, including [how to reproduce
186
+ it](https://github.com/EleutherAI/pythia/blob/main/README.md#reproducing-training).<br>
187
+ Pythia uses the same tokenizer as [GPT-NeoX-
188
+ 20B](https://huggingface.co/EleutherAI/gpt-neox-20b).
189
 
190
  ### Evaluations
191
 
192
  All 16 *Pythia* models were evaluated using the [LM Evaluation
193
  Harness](https://github.com/EleutherAI/lm-evaluation-harness). You can access
194
  the results by model and step at `results/json/*` in the [GitHub
195
+ repository](https://github.com/EleutherAI/pythia/tree/main/results/json).<br>
196
+ Expand the sections below to see plots of evaluation results for all
197
+ Pythia and Pythia-deduped models compared with OPT and BLOOM.
198
+
199
+ <details>
200
+ <summary>LAMBADA – OpenAI</summary>
201
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/lambada_openai.png" style="width:auto"/>
202
+ </details>
203
+
204
+ <details>
205
+ <summary>Physical Interaction: Question Answering (PIQA)</summary>
206
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/piqa.png" style="width:auto"/>
207
+ </details>
208
+
209
+ <details>
210
+ <summary>WinoGrande</summary>
211
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/winogrande.png" style="width:auto"/>
212
+ </details>
213
+
214
+ <details>
215
+ <summary>AI2 Reasoning Challenge—Challenge Set</summary>
216
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/arc_challenge.png" style="width:auto"/>
217
+ </details>
218
+
219
+ <details>
220
+ <summary>SciQ</summary>
221
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/sciq.png" style="width:auto"/>
222
+ </details>
223
 
224
  ### Naming convention and parameter count
225
 
226
+ *Pythia* models were renamed in January 2023. It is possible that the old
227
  naming convention still persists in some documentation by accident. The
228
  current naming convention (70M, 160M, etc.) is based on total parameter count.
229