File size: 10,216 Bytes
591f386
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
---
language:
- en
tags:
- pytorch
- causal-lm
- pythia
license: apache-2.0
datasets:
- the_pile
---

The *Pythia Scaling Suite* is a collection of models developed to facilitate 
interpretability research. It contains two sets of eight models of sizes 
70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, and 12B. For each size, there are two 
models: one trained on the Pile, and one trained on the Pile after the dataset 
has been globally deduplicated. All 8 model sizes are trained on the exact 
same data, in the exact same order. All Pythia models are available 
[on Hugging Face](https://huggingface.co/EleutherAI).

Some design choices were made for the sake of interpretability research and 
to ensure consistency across all models. However, the Pythia models are 
competitive with, or mildly outperform, other similar and same-sized models, 
such as OPT and the GPT-Neo suite.

Please note that all models in the *Pythia* suite were re-named in January 
2023. For clarity, a <a href="#naming-convention-and-parameter-count">table 
comparing the old and new names</a> is provided in this model card, together 
with exact model parameter counts.

## Pythia-70M

### Model Details

- Developed by: [EleutherAI](http://eleuther.ai)
- Model type: Transformer-based Language Model
- Language: English
- Learn more: [Pythia's GitHub repository](https://github.com/EleutherAI/pythia)
 for training procedure, config files, and details on how to use.
- Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
- License: Apache 2.0
- Contact: to ask questions about this model, join the [EleutherAI
 Discord](https://discord.gg/zBGx3azzUn), and post them in `#release-discussion`.
 Please read the existing *Pythia* documentation before asking about it in the
EleutherAI Discord. For general correspondence:
[contact@eleuther.ai](mailto:contact@eleuther.ai).

<figure>

| Pythia model | Non-Embedding Params | Layers | Model Dim | Heads | Batch Size | Learning Rate         | Equivalent Models      |
| -----------: | -------------------: | :----: | :-------: | :---: | :--------: | :-------------------: | :--------------------: |
| 70M          | 18,915,328           | 6      | 512       | 8     | 2M         | 1.0 x 10<sup>-3</sup> | —                      |
| 160M         | 85,056,000           | 12     | 768       | 12    | 4M         | 6.0 x 10<sup>-4</sup> | GPT-Neo 125M, OPT-125M |
| 410M         | 302,311,424          | 24     | 1024      | 16    | 4M         | 3.0 x 10<sup>-4</sup> | OPT-350M               |
| 1.0B         | 805,736,448          | 16     | 2048      | 8     | 2M         | 3.0 x 10<sup>-4</sup> | —                      |
| 1.4B         | 1,208,602,624        | 24     | 2048      | 16    | 4M         | 2.0 x 10<sup>-4</sup> | GPT-Neo 1.3B, OPT-1.3B |
| 2.8B         | 2,517,652,480        | 32     | 2560      | 32    | 2M         | 1.6 x 10<sup>-4</sup> | GPT-Neo 2.7B, OPT-2.7B |
| 6.9B         | 6,444,163,072        | 32     | 4096      | 32    | 2M         | 1.2 x 10<sup>-4</sup> | OPT-6.7B               |
| 12B          | 11,327,027,200       | 36     | 5120      | 40    | 2M         | 1.2 x 10<sup>-4</sup> | —                      |
<figcaption>Engineering details for the <i>Pythia Suite</i>. Deduped and 
non-deduped models of a given size have the same hyperparameters. “Equivalent” 
models have <b>exactly</b> the same architecture, and the same number of 
non-embedding parameters.</figcaption>
</figure>

### Uses and Limitations

#### Intended Use

All Pythia models were developed specifically for research purposes. This 
suite is intended to provide a controlled setting for performing scientific 
experiments. To enable the study of how language models change over the course 
of training, we provide 143 evenly spaced intermediate checkpoints per model. 
These checkpoints are hosted on Hugging Face as branches. Note that branch 
`143000` corresponds exactly to the model checkpoint on the `main` branch 
of each model.

#### Out-of-scope use

Performance on NLP benchmarks is not a priority for *Pythia* models, although 
its evaluation results are competitive with similarly-sized language models, 
such as those from the OPT and BLOOM suites.

Pythia-70M has not been fine-tuned for downstream tasks for which 
language models are commonly deployed, such as writing genre prose, 
or commercial chatbots. This means Pythia-70M will likely **not** 
respond to a given prompt the way e.g. ChatGPT does. This is because, unlike 
this model, ChatGPT was fine-tuned using Reinforcement Learning from Human 
Feedback (RLHF) to better “understand” human instructions.

#### Limitations and biases

The core functionality of a large language model is to take a string of text 
and predict the next token. The token deemed statistically most likely by the 
model need not produce the most “accurate” text. Never rely on 
Pythia-70M to produce factually accurate output.

This model was trained on [the Pile](https://pile.eleuther.ai/), a dataset 
known to contain profanity and texts that are lewd or otherwise offensive. 
See [Section 6 of the Pile paper](https://arxiv.org/abs/2101.00027) for a 
discussion of documented biases with regards to gender, religion, and race. 
Pythia-70M may produce socially unacceptable or undesirable text, 
*even if* the prompt itself does not include anything explicitly offensive. 

If you plan on using text generated through, for example, the Hosted Inference 
API, we recommend having a human curate the outputs of this language model 
before presenting it to other people. Please inform your audience that the 
text was generated by Pythia-70M.

### Quickstart

Pythia models can be loaded and used via the following code, demonstrated here 
for the third `pythia-70m-deduped` checkpoint:

```python
from transformers import GPTNeoXForCausalLM, AutoTokenizer

model = GPTNeoXForCausalLM.from_pretrained(
  "EleutherAI/pythia-70m-deduped",
  revision="step3000",
  cache_dir="./pythia-70m-deduped/step3000",
)

tokenizer = AutoTokenizer.from_pretrained(
  "EleutherAI/pythia-70m-deduped",
  revision="step3000",
  cache_dir="./pythia-70m-deduped/step3000",
)

inputs = tokenizer("Hello, I am", return_tensors="pt")
tokens = model.generate(**inputs)
tokenizer.decode(tokens[0])
```

Revision/branch `step143000` corresponds exactly to the model checkpoint on 
the `main` branch of each model.

For more information on how to use all Pythia models, see [documentation on 
GitHub](https://github.com/EleutherAI/pythia).

### Training

#### Training data

[The Pile](https://pile.eleuther.ai/) is a 825GiB general-purpose dataset in 
English. It was created by EleutherAI specifically for training large language 
models. It contains texts from 22 diverse sources, roughly broken down into 
five categories: academic writing (e.g. arXiv), internet (e.g. CommonCrawl), 
prose (e.g. Project Gutenberg), dialogue (e.g. YouTube subtitles), and 
miscellaneous (e.g. GitHub, Enron Emails). See [the Pile 
paper](https://arxiv.org/abs/2101.00027) for a breakdown of all data sources, 
methodology, and a discussion of ethical implications. Consult [the 
datasheet](https://arxiv.org/abs/2201.07311) for more detailed documentation 
about the Pile and its component datasets. The Pile can be downloaded from 
the [official website](https://pile.eleuther.ai/), or from a [community 
mirror](https://the-eye.eu/public/AI/pile/).

The Pile was **not** deduplicated before being used to train Pythia-70M.

#### Training procedure

All models were trained on the exact same data, in the exact same order. Each 
model saw 299,892,736,000 tokens during training, and 143 checkpoints for each 
model are saved every 2,097,152,000 tokens, spaced evenly throughout training. 
This corresponds to training for just under 1 epoch on the Pile for 
non-deduplicated models, and about 1.5 epochs on the deduplicated Pile.

All Pythia models trained for the equivalent of 143000 steps at a batch size 
of 2,097,152 tokens. Two batch sizes were used: 2M and 4M. Models with a batch 
size of 4M tokens listed were originally trained for 71500 steps instead, with 
checkpoints every 500 steps. The checkpoints on Hugging Face are renamed for 
consistency with all 2M batch models, so `step1000` is the first checkpoint 
for `pythia-1.4b` that was saved (corresponding to step 500 in training), and 
`step1000` is likewise the first `pythia-6.9b` checkpoint that was saved 
(corresponding to 1000 “actual” steps).

See [GitHub](https://github.com/EleutherAI/pythia) for more details on training
 procedure, including [how to reproduce 
it](https://github.com/EleutherAI/pythia/blob/main/README.md#reproducing-training).

### Evaluations

All 16 *Pythia* models were evaluated using the [LM Evaluation 
Harness](https://github.com/EleutherAI/lm-evaluation-harness). You can access 
the results by model and step at `results/json/*` in the [GitHub 
repository](https://github.com/EleutherAI/pythia/tree/main/results/json).

February 2023 note: select evaluations and comparison with OPT and BLOOM 
models will be added here at a later date.

### Naming convention and parameter count

Pythia models were re-named in January 2023. It is possible that the old 
naming convention still persists in some documentation by accident. The 
current naming convention (70M, 160M, etc.) is based on total parameter count. 

<figure style="width:32em">
  
| current Pythia suffix | old suffix | total params   | non-embedding params |
| --------------------: | ---------: | -------------: | -------------------: |
| 70M                   | 19M        | 70,426,624     | 18,915,328           |
| 160M                  | 125M       | 162,322,944    | 85,056,000           |
| 410M                  | 350M       | 405,334,016    | 302,311,424          |
| 1B                    | 800M       | 1,011,781,632  | 805,736,448          |
| 1.4B                  | 1.3B       | 1,414,647,808  | 1,208,602,624        |
| 2.8B                  | 2.7B       | 2,775,208,960  | 2,517,652,480        |
| 6.9B                  | 6.7B       | 6,857,302,016  | 6,444,163,072        |
| 12B                   | 13B        | 11,846,072,320 | 11,327,027,200       |
</figure>