avi-skowron commited on
Commit
8aa12eb
1 Parent(s): 60cc0d0

Add evaluations

Browse files
Files changed (1) hide show
  1. README.md +75 -42
README.md CHANGED
@@ -16,17 +16,18 @@ interpretability research. It contains two sets of eight models of sizes
16
  models: one trained on the Pile, and one trained on the Pile after the dataset
17
  has been globally deduplicated. All 8 model sizes are trained on the exact
18
  same data, in the exact same order. All Pythia models are available
19
- [on Hugging Face](https://huggingface.co/EleutherAI).
20
 
21
- Some design choices were made for the sake of interpretability research and
22
- to ensure consistency across all models. However, the Pythia models are
23
- competitive with, or mildly outperform, other similar and same-sized models,
24
- such as OPT and the GPT-Neo suite.
 
25
 
26
- Please note that all models in the *Pythia* suite were re-named in January
27
  2023. For clarity, a <a href="#naming-convention-and-parameter-count">table
28
  comparing the old and new names</a> is provided in this model card, together
29
- with exact model parameter counts.
30
 
31
  ## Pythia-70M
32
 
@@ -39,11 +40,11 @@ with exact model parameter counts.
39
  for training procedure, config files, and details on how to use.
40
  - Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
41
  - License: Apache 2.0
42
- - Contact: to ask questions about this model, join the [EleutherAI
43
- Discord](https://discord.gg/zBGx3azzUn), and post them in `#release-discussion`.
44
- Please read the existing *Pythia* documentation before asking about it in the
45
- EleutherAI Discord. For general correspondence:
46
- [contact@eleuther.ai](mailto:contact@eleuther.ai).
47
 
48
  <figure>
49
 
@@ -67,26 +68,35 @@ non-embedding parameters.</figcaption>
67
 
68
  #### Intended Use
69
 
70
- All Pythia models were developed specifically for research purposes. This
71
- suite is intended to provide a controlled setting for performing scientific
72
- experiments. To enable the study of how language models change over the course
73
- of training, we provide 143 evenly spaced intermediate checkpoints per model.
74
- These checkpoints are hosted on Hugging Face as branches. Note that branch
75
- `143000` corresponds exactly to the model checkpoint on the `main` branch
76
- of each model.
 
 
 
 
 
 
 
77
 
78
  #### Out-of-scope use
79
 
80
- Performance on NLP benchmarks is not a priority for *Pythia* models, although
81
- its evaluation results are competitive with similarly-sized language models,
82
- such as those from the OPT and BLOOM suites.
83
 
84
- Pythia-70M has not been fine-tuned for downstream tasks for which
 
 
 
85
  language models are commonly deployed, such as writing genre prose,
86
- or commercial chatbots. This means Pythia-70M will likely **not**
87
- respond to a given prompt the way e.g. ChatGPT does. This is because, unlike
88
- this model, ChatGPT was fine-tuned using Reinforcement Learning from Human
89
- Feedback (RLHF) to better “understand” human instructions.
90
 
91
  #### Limitations and biases
92
 
@@ -99,8 +109,8 @@ This model was trained on [the Pile](https://pile.eleuther.ai/), a dataset
99
  known to contain profanity and texts that are lewd or otherwise offensive.
100
  See [Section 6 of the Pile paper](https://arxiv.org/abs/2101.00027) for a
101
  discussion of documented biases with regards to gender, religion, and race.
102
- Pythia-70M may produce socially unacceptable or undesirable text,
103
- *even if* the prompt itself does not include anything explicitly offensive.
104
 
105
  If you plan on using text generated through, for example, the Hosted Inference
106
  API, we recommend having a human curate the outputs of this language model
@@ -133,8 +143,7 @@ tokenizer.decode(tokens[0])
133
  ```
134
 
135
  Revision/branch `step143000` corresponds exactly to the model checkpoint on
136
- the `main` branch of each model.
137
-
138
  For more information on how to use all Pythia models, see [documentation on
139
  GitHub](https://github.com/EleutherAI/pythia).
140
 
@@ -153,8 +162,7 @@ methodology, and a discussion of ethical implications. Consult [the
153
  datasheet](https://arxiv.org/abs/2201.07311) for more detailed documentation
154
  about the Pile and its component datasets. The Pile can be downloaded from
155
  the [official website](https://pile.eleuther.ai/), or from a [community
156
- mirror](https://the-eye.eu/public/AI/pile/).
157
-
158
  The Pile was **not** deduplicated before being used to train Pythia-70M.
159
 
160
  #### Training procedure
@@ -165,32 +173,57 @@ model are saved every 2,097,152,000 tokens, spaced evenly throughout training.
165
  This corresponds to training for just under 1 epoch on the Pile for
166
  non-deduplicated models, and about 1.5 epochs on the deduplicated Pile.
167
 
168
- All Pythia models trained for the equivalent of 143000 steps at a batch size
169
  of 2,097,152 tokens. Two batch sizes were used: 2M and 4M. Models with a batch
170
  size of 4M tokens listed were originally trained for 71500 steps instead, with
171
  checkpoints every 500 steps. The checkpoints on Hugging Face are renamed for
172
  consistency with all 2M batch models, so `step1000` is the first checkpoint
173
  for `pythia-1.4b` that was saved (corresponding to step 500 in training), and
174
  `step1000` is likewise the first `pythia-6.9b` checkpoint that was saved
175
- (corresponding to 1000 “actual” steps).
176
-
177
  See [GitHub](https://github.com/EleutherAI/pythia) for more details on training
178
  procedure, including [how to reproduce
179
- it](https://github.com/EleutherAI/pythia/blob/main/README.md#reproducing-training).
 
 
180
 
181
  ### Evaluations
182
 
183
  All 16 *Pythia* models were evaluated using the [LM Evaluation
184
  Harness](https://github.com/EleutherAI/lm-evaluation-harness). You can access
185
  the results by model and step at `results/json/*` in the [GitHub
186
- repository](https://github.com/EleutherAI/pythia/tree/main/results/json).
187
-
188
- February 2023 note: select evaluations and comparison with OPT and BLOOM
189
- models will be added here at a later date.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
190
 
191
  ### Naming convention and parameter count
192
 
193
- Pythia models were re-named in January 2023. It is possible that the old
194
  naming convention still persists in some documentation by accident. The
195
  current naming convention (70M, 160M, etc.) is based on total parameter count.
196
 
 
16
  models: one trained on the Pile, and one trained on the Pile after the dataset
17
  has been globally deduplicated. All 8 model sizes are trained on the exact
18
  same data, in the exact same order. All Pythia models are available
19
+ [on Hugging Face](https://huggingface.co/models?other=pythia).
20
 
21
+ The Pythia model suite was deliberately designed to promote scientific
22
+ research on large language models, especially interpretability research.
23
+ Despite not centering downstream performance as a design goal, we find the
24
+ models <a href="#evaluations">match or exceed</a> the performance of
25
+ similar and same-sized models, such as those in the OPT and GPT-Neo suites.
26
 
27
+ Please note that all models in the *Pythia* suite were renamed in January
28
  2023. For clarity, a <a href="#naming-convention-and-parameter-count">table
29
  comparing the old and new names</a> is provided in this model card, together
30
+ with exact parameter counts.
31
 
32
  ## Pythia-70M
33
 
 
40
  for training procedure, config files, and details on how to use.
41
  - Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
42
  - License: Apache 2.0
43
+ - Contact: to ask questions about this model, join the [EleutherAI
44
+ Discord](https://discord.gg/zBGx3azzUn), and post them in `#release-discussion`.
45
+ Please read the existing *Pythia* documentation before asking about it in the
46
+ EleutherAI Discord. For general correspondence: [contact@eleuther.
47
+ ai](mailto:contact@eleuther.ai).
48
 
49
  <figure>
50
 
 
68
 
69
  #### Intended Use
70
 
71
+ The primary intended use of Pythia is research on the behavior, functionality,
72
+ and limitations of large language models. This suite is intended to provide
73
+ a controlled setting for performing scientific experiments. To enable the
74
+ study of how language models change over the course of training, we provide
75
+ 143 evenly spaced intermediate checkpoints per model. These checkpoints are
76
+ hosted on Hugging Face as branches. Note that branch `143000` corresponds
77
+ exactly to the model checkpoint on the `main` branch of each model.
78
+
79
+ You may also further fine-tune and adapt Pythia-70M for deployment,
80
+ as long as your use is in accordance with the Apache 2.0 license. Pythia
81
+ models work with the Hugging Face [Transformers
82
+ Library](https://huggingface.co/docs/transformers/index). If you decide to use
83
+ pre-trained Pythia-70M as a basis for your fine-tuned model, please
84
+ conduct your own risk and bias assessment.
85
 
86
  #### Out-of-scope use
87
 
88
+ The Pythia Suite is **not** intended for deployment. It is not a in itself
89
+ a product and cannot be used for human-facing interactions.
 
90
 
91
+ Pythia models are English-language only, and are not suitable for translation
92
+ or generating text in other languages.
93
+
94
+ Pythia-70M has not been fine-tuned for downstream contexts in which
95
  language models are commonly deployed, such as writing genre prose,
96
+ or commercial chatbots. This means Pythia-70M will **not**
97
+ respond to a given prompt the way a product like ChatGPT does. This is because,
98
+ unlike this model, ChatGPT was fine-tuned using methods such as Reinforcement
99
+ Learning from Human Feedback (RLHF) to better “understand” human instructions.
100
 
101
  #### Limitations and biases
102
 
 
109
  known to contain profanity and texts that are lewd or otherwise offensive.
110
  See [Section 6 of the Pile paper](https://arxiv.org/abs/2101.00027) for a
111
  discussion of documented biases with regards to gender, religion, and race.
112
+ Pythia-70M may produce socially unacceptable or undesirable text, *even if*
113
+ the prompt itself does not include anything explicitly offensive.
114
 
115
  If you plan on using text generated through, for example, the Hosted Inference
116
  API, we recommend having a human curate the outputs of this language model
 
143
  ```
144
 
145
  Revision/branch `step143000` corresponds exactly to the model checkpoint on
146
+ the `main` branch of each model.<br>
 
147
  For more information on how to use all Pythia models, see [documentation on
148
  GitHub](https://github.com/EleutherAI/pythia).
149
 
 
162
  datasheet](https://arxiv.org/abs/2201.07311) for more detailed documentation
163
  about the Pile and its component datasets. The Pile can be downloaded from
164
  the [official website](https://pile.eleuther.ai/), or from a [community
165
+ mirror](https://the-eye.eu/public/AI/pile/).<br>
 
166
  The Pile was **not** deduplicated before being used to train Pythia-70M.
167
 
168
  #### Training procedure
 
173
  This corresponds to training for just under 1 epoch on the Pile for
174
  non-deduplicated models, and about 1.5 epochs on the deduplicated Pile.
175
 
176
+ All *Pythia* models trained for the equivalent of 143000 steps at a batch size
177
  of 2,097,152 tokens. Two batch sizes were used: 2M and 4M. Models with a batch
178
  size of 4M tokens listed were originally trained for 71500 steps instead, with
179
  checkpoints every 500 steps. The checkpoints on Hugging Face are renamed for
180
  consistency with all 2M batch models, so `step1000` is the first checkpoint
181
  for `pythia-1.4b` that was saved (corresponding to step 500 in training), and
182
  `step1000` is likewise the first `pythia-6.9b` checkpoint that was saved
183
+ (corresponding to 1000 “actual” steps).<br>
 
184
  See [GitHub](https://github.com/EleutherAI/pythia) for more details on training
185
  procedure, including [how to reproduce
186
+ it](https://github.com/EleutherAI/pythia/blob/main/README.md#reproducing-training).<br>
187
+ Pythia uses the same tokenizer as [GPT-NeoX-
188
+ 20B](https://huggingface.co/EleutherAI/gpt-neox-20b).
189
 
190
  ### Evaluations
191
 
192
  All 16 *Pythia* models were evaluated using the [LM Evaluation
193
  Harness](https://github.com/EleutherAI/lm-evaluation-harness). You can access
194
  the results by model and step at `results/json/*` in the [GitHub
195
+ repository](https://github.com/EleutherAI/pythia/tree/main/results/json).<br>
196
+ Expand the sections below to see plots of evaluation results for all
197
+ Pythia and Pythia-deduped models compared with OPT and BLOOM.
198
+
199
+ <details>
200
+ <summary>LAMBADA – OpenAI</summary>
201
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/lambada_openai.png" style="width:auto"/>
202
+ </details>
203
+
204
+ <details>
205
+ <summary>Physical Interaction: Question Answering (PIQA)</summary>
206
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/piqa.png" style="width:auto"/>
207
+ </details>
208
+
209
+ <details>
210
+ <summary>WinoGrande</summary>
211
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/winogrande.png" style="width:auto"/>
212
+ </details>
213
+
214
+ <details>
215
+ <summary>AI2 Reasoning Challenge—Challenge Set</summary>
216
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/arc_challenge.png" style="width:auto"/>
217
+ </details>
218
+
219
+ <details>
220
+ <summary>SciQ</summary>
221
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/sciq.png" style="width:auto"/>
222
+ </details>
223
 
224
  ### Naming convention and parameter count
225
 
226
+ *Pythia* models were renamed in January 2023. It is possible that the old
227
  naming convention still persists in some documentation by accident. The
228
  current naming convention (70M, 160M, etc.) is based on total parameter count.
229