avi-skowron
commited on
Commit
•
8aa12eb
1
Parent(s):
60cc0d0
Add evaluations
Browse files
README.md
CHANGED
@@ -16,17 +16,18 @@ interpretability research. It contains two sets of eight models of sizes
|
|
16 |
models: one trained on the Pile, and one trained on the Pile after the dataset
|
17 |
has been globally deduplicated. All 8 model sizes are trained on the exact
|
18 |
same data, in the exact same order. All Pythia models are available
|
19 |
-
[on Hugging Face](https://huggingface.co/
|
20 |
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
|
|
|
25 |
|
26 |
-
Please note that all models in the *Pythia* suite were
|
27 |
2023. For clarity, a <a href="#naming-convention-and-parameter-count">table
|
28 |
comparing the old and new names</a> is provided in this model card, together
|
29 |
-
with exact
|
30 |
|
31 |
## Pythia-70M
|
32 |
|
@@ -39,11 +40,11 @@ with exact model parameter counts.
|
|
39 |
for training procedure, config files, and details on how to use.
|
40 |
- Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
|
41 |
- License: Apache 2.0
|
42 |
-
- Contact: to ask questions about this model, join the [EleutherAI
|
43 |
-
|
44 |
-
Please read the existing *Pythia* documentation before asking about it in the
|
45 |
-
EleutherAI Discord. For general correspondence:
|
46 |
-
|
47 |
|
48 |
<figure>
|
49 |
|
@@ -67,26 +68,35 @@ non-embedding parameters.</figcaption>
|
|
67 |
|
68 |
#### Intended Use
|
69 |
|
70 |
-
|
71 |
-
|
72 |
-
|
73 |
-
of
|
74 |
-
|
75 |
-
|
76 |
-
of each model.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
77 |
|
78 |
#### Out-of-scope use
|
79 |
|
80 |
-
|
81 |
-
|
82 |
-
such as those from the OPT and BLOOM suites.
|
83 |
|
84 |
-
Pythia
|
|
|
|
|
|
|
85 |
language models are commonly deployed, such as writing genre prose,
|
86 |
-
or commercial chatbots. This means Pythia-70M will
|
87 |
-
respond to a given prompt the way
|
88 |
-
this model, ChatGPT was fine-tuned using
|
89 |
-
Feedback (RLHF) to better “understand” human instructions.
|
90 |
|
91 |
#### Limitations and biases
|
92 |
|
@@ -99,8 +109,8 @@ This model was trained on [the Pile](https://pile.eleuther.ai/), a dataset
|
|
99 |
known to contain profanity and texts that are lewd or otherwise offensive.
|
100 |
See [Section 6 of the Pile paper](https://arxiv.org/abs/2101.00027) for a
|
101 |
discussion of documented biases with regards to gender, religion, and race.
|
102 |
-
Pythia-70M may produce socially unacceptable or undesirable text,
|
103 |
-
|
104 |
|
105 |
If you plan on using text generated through, for example, the Hosted Inference
|
106 |
API, we recommend having a human curate the outputs of this language model
|
@@ -133,8 +143,7 @@ tokenizer.decode(tokens[0])
|
|
133 |
```
|
134 |
|
135 |
Revision/branch `step143000` corresponds exactly to the model checkpoint on
|
136 |
-
the `main` branch of each model
|
137 |
-
|
138 |
For more information on how to use all Pythia models, see [documentation on
|
139 |
GitHub](https://github.com/EleutherAI/pythia).
|
140 |
|
@@ -153,8 +162,7 @@ methodology, and a discussion of ethical implications. Consult [the
|
|
153 |
datasheet](https://arxiv.org/abs/2201.07311) for more detailed documentation
|
154 |
about the Pile and its component datasets. The Pile can be downloaded from
|
155 |
the [official website](https://pile.eleuther.ai/), or from a [community
|
156 |
-
mirror](https://the-eye.eu/public/AI/pile/)
|
157 |
-
|
158 |
The Pile was **not** deduplicated before being used to train Pythia-70M.
|
159 |
|
160 |
#### Training procedure
|
@@ -165,32 +173,57 @@ model are saved every 2,097,152,000 tokens, spaced evenly throughout training.
|
|
165 |
This corresponds to training for just under 1 epoch on the Pile for
|
166 |
non-deduplicated models, and about 1.5 epochs on the deduplicated Pile.
|
167 |
|
168 |
-
All Pythia models trained for the equivalent of 143000 steps at a batch size
|
169 |
of 2,097,152 tokens. Two batch sizes were used: 2M and 4M. Models with a batch
|
170 |
size of 4M tokens listed were originally trained for 71500 steps instead, with
|
171 |
checkpoints every 500 steps. The checkpoints on Hugging Face are renamed for
|
172 |
consistency with all 2M batch models, so `step1000` is the first checkpoint
|
173 |
for `pythia-1.4b` that was saved (corresponding to step 500 in training), and
|
174 |
`step1000` is likewise the first `pythia-6.9b` checkpoint that was saved
|
175 |
-
(corresponding to 1000 “actual” steps)
|
176 |
-
|
177 |
See [GitHub](https://github.com/EleutherAI/pythia) for more details on training
|
178 |
procedure, including [how to reproduce
|
179 |
-
it](https://github.com/EleutherAI/pythia/blob/main/README.md#reproducing-training)
|
|
|
|
|
180 |
|
181 |
### Evaluations
|
182 |
|
183 |
All 16 *Pythia* models were evaluated using the [LM Evaluation
|
184 |
Harness](https://github.com/EleutherAI/lm-evaluation-harness). You can access
|
185 |
the results by model and step at `results/json/*` in the [GitHub
|
186 |
-
repository](https://github.com/EleutherAI/pythia/tree/main/results/json)
|
187 |
-
|
188 |
-
|
189 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
190 |
|
191 |
### Naming convention and parameter count
|
192 |
|
193 |
-
Pythia models were
|
194 |
naming convention still persists in some documentation by accident. The
|
195 |
current naming convention (70M, 160M, etc.) is based on total parameter count.
|
196 |
|
|
|
16 |
models: one trained on the Pile, and one trained on the Pile after the dataset
|
17 |
has been globally deduplicated. All 8 model sizes are trained on the exact
|
18 |
same data, in the exact same order. All Pythia models are available
|
19 |
+
[on Hugging Face](https://huggingface.co/models?other=pythia).
|
20 |
|
21 |
+
The Pythia model suite was deliberately designed to promote scientific
|
22 |
+
research on large language models, especially interpretability research.
|
23 |
+
Despite not centering downstream performance as a design goal, we find the
|
24 |
+
models <a href="#evaluations">match or exceed</a> the performance of
|
25 |
+
similar and same-sized models, such as those in the OPT and GPT-Neo suites.
|
26 |
|
27 |
+
Please note that all models in the *Pythia* suite were renamed in January
|
28 |
2023. For clarity, a <a href="#naming-convention-and-parameter-count">table
|
29 |
comparing the old and new names</a> is provided in this model card, together
|
30 |
+
with exact parameter counts.
|
31 |
|
32 |
## Pythia-70M
|
33 |
|
|
|
40 |
for training procedure, config files, and details on how to use.
|
41 |
- Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
|
42 |
- License: Apache 2.0
|
43 |
+
- Contact: to ask questions about this model, join the [EleutherAI
|
44 |
+
Discord](https://discord.gg/zBGx3azzUn), and post them in `#release-discussion`.
|
45 |
+
Please read the existing *Pythia* documentation before asking about it in the
|
46 |
+
EleutherAI Discord. For general correspondence: [contact@eleuther.
|
47 |
+
ai](mailto:contact@eleuther.ai).
|
48 |
|
49 |
<figure>
|
50 |
|
|
|
68 |
|
69 |
#### Intended Use
|
70 |
|
71 |
+
The primary intended use of Pythia is research on the behavior, functionality,
|
72 |
+
and limitations of large language models. This suite is intended to provide
|
73 |
+
a controlled setting for performing scientific experiments. To enable the
|
74 |
+
study of how language models change over the course of training, we provide
|
75 |
+
143 evenly spaced intermediate checkpoints per model. These checkpoints are
|
76 |
+
hosted on Hugging Face as branches. Note that branch `143000` corresponds
|
77 |
+
exactly to the model checkpoint on the `main` branch of each model.
|
78 |
+
|
79 |
+
You may also further fine-tune and adapt Pythia-70M for deployment,
|
80 |
+
as long as your use is in accordance with the Apache 2.0 license. Pythia
|
81 |
+
models work with the Hugging Face [Transformers
|
82 |
+
Library](https://huggingface.co/docs/transformers/index). If you decide to use
|
83 |
+
pre-trained Pythia-70M as a basis for your fine-tuned model, please
|
84 |
+
conduct your own risk and bias assessment.
|
85 |
|
86 |
#### Out-of-scope use
|
87 |
|
88 |
+
The Pythia Suite is **not** intended for deployment. It is not a in itself
|
89 |
+
a product and cannot be used for human-facing interactions.
|
|
|
90 |
|
91 |
+
Pythia models are English-language only, and are not suitable for translation
|
92 |
+
or generating text in other languages.
|
93 |
+
|
94 |
+
Pythia-70M has not been fine-tuned for downstream contexts in which
|
95 |
language models are commonly deployed, such as writing genre prose,
|
96 |
+
or commercial chatbots. This means Pythia-70M will **not**
|
97 |
+
respond to a given prompt the way a product like ChatGPT does. This is because,
|
98 |
+
unlike this model, ChatGPT was fine-tuned using methods such as Reinforcement
|
99 |
+
Learning from Human Feedback (RLHF) to better “understand” human instructions.
|
100 |
|
101 |
#### Limitations and biases
|
102 |
|
|
|
109 |
known to contain profanity and texts that are lewd or otherwise offensive.
|
110 |
See [Section 6 of the Pile paper](https://arxiv.org/abs/2101.00027) for a
|
111 |
discussion of documented biases with regards to gender, religion, and race.
|
112 |
+
Pythia-70M may produce socially unacceptable or undesirable text, *even if*
|
113 |
+
the prompt itself does not include anything explicitly offensive.
|
114 |
|
115 |
If you plan on using text generated through, for example, the Hosted Inference
|
116 |
API, we recommend having a human curate the outputs of this language model
|
|
|
143 |
```
|
144 |
|
145 |
Revision/branch `step143000` corresponds exactly to the model checkpoint on
|
146 |
+
the `main` branch of each model.<br>
|
|
|
147 |
For more information on how to use all Pythia models, see [documentation on
|
148 |
GitHub](https://github.com/EleutherAI/pythia).
|
149 |
|
|
|
162 |
datasheet](https://arxiv.org/abs/2201.07311) for more detailed documentation
|
163 |
about the Pile and its component datasets. The Pile can be downloaded from
|
164 |
the [official website](https://pile.eleuther.ai/), or from a [community
|
165 |
+
mirror](https://the-eye.eu/public/AI/pile/).<br>
|
|
|
166 |
The Pile was **not** deduplicated before being used to train Pythia-70M.
|
167 |
|
168 |
#### Training procedure
|
|
|
173 |
This corresponds to training for just under 1 epoch on the Pile for
|
174 |
non-deduplicated models, and about 1.5 epochs on the deduplicated Pile.
|
175 |
|
176 |
+
All *Pythia* models trained for the equivalent of 143000 steps at a batch size
|
177 |
of 2,097,152 tokens. Two batch sizes were used: 2M and 4M. Models with a batch
|
178 |
size of 4M tokens listed were originally trained for 71500 steps instead, with
|
179 |
checkpoints every 500 steps. The checkpoints on Hugging Face are renamed for
|
180 |
consistency with all 2M batch models, so `step1000` is the first checkpoint
|
181 |
for `pythia-1.4b` that was saved (corresponding to step 500 in training), and
|
182 |
`step1000` is likewise the first `pythia-6.9b` checkpoint that was saved
|
183 |
+
(corresponding to 1000 “actual” steps).<br>
|
|
|
184 |
See [GitHub](https://github.com/EleutherAI/pythia) for more details on training
|
185 |
procedure, including [how to reproduce
|
186 |
+
it](https://github.com/EleutherAI/pythia/blob/main/README.md#reproducing-training).<br>
|
187 |
+
Pythia uses the same tokenizer as [GPT-NeoX-
|
188 |
+
20B](https://huggingface.co/EleutherAI/gpt-neox-20b).
|
189 |
|
190 |
### Evaluations
|
191 |
|
192 |
All 16 *Pythia* models were evaluated using the [LM Evaluation
|
193 |
Harness](https://github.com/EleutherAI/lm-evaluation-harness). You can access
|
194 |
the results by model and step at `results/json/*` in the [GitHub
|
195 |
+
repository](https://github.com/EleutherAI/pythia/tree/main/results/json).<br>
|
196 |
+
Expand the sections below to see plots of evaluation results for all
|
197 |
+
Pythia and Pythia-deduped models compared with OPT and BLOOM.
|
198 |
+
|
199 |
+
<details>
|
200 |
+
<summary>LAMBADA – OpenAI</summary>
|
201 |
+
<img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/lambada_openai.png" style="width:auto"/>
|
202 |
+
</details>
|
203 |
+
|
204 |
+
<details>
|
205 |
+
<summary>Physical Interaction: Question Answering (PIQA)</summary>
|
206 |
+
<img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/piqa.png" style="width:auto"/>
|
207 |
+
</details>
|
208 |
+
|
209 |
+
<details>
|
210 |
+
<summary>WinoGrande</summary>
|
211 |
+
<img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/winogrande.png" style="width:auto"/>
|
212 |
+
</details>
|
213 |
+
|
214 |
+
<details>
|
215 |
+
<summary>AI2 Reasoning Challenge—Challenge Set</summary>
|
216 |
+
<img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/arc_challenge.png" style="width:auto"/>
|
217 |
+
</details>
|
218 |
+
|
219 |
+
<details>
|
220 |
+
<summary>SciQ</summary>
|
221 |
+
<img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/sciq.png" style="width:auto"/>
|
222 |
+
</details>
|
223 |
|
224 |
### Naming convention and parameter count
|
225 |
|
226 |
+
*Pythia* models were renamed in January 2023. It is possible that the old
|
227 |
naming convention still persists in some documentation by accident. The
|
228 |
current naming convention (70M, 160M, etc.) is based on total parameter count.
|
229 |
|