Files changed (1) hide show
  1. README.md +135 -0
README.md ADDED
@@ -0,0 +1,135 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - pytorch
6
+ - causal-lm
7
+ - pythia
8
+ license: apache-2.0
9
+ datasets:
10
+ - the_pile
11
+ ---
12
+
13
+ The *Pythia Scaling Suite* is a collection of models developed to facilitate interpretability research. It contains two sets of eight models of sizes 70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, and 12B. For each size, there are two models: one trained on the Pile, and one trained on the Pile after the dataset has been globally deduplicated. All 8 model sizes are trained on the exact same data, in the exact same order. All Pythia models are available [on Hugging Face](https://huggingface.co/EleutherAI).
14
+
15
+ Some design choices were made for the sake of interpretability research at the cost of downstream performance, and for design consistency across all models. However, the *Pythia* models are competitive with or mildly outperform other similar same-sized models, such as the OPT model family and the GPT-Neo models. If you are **not** investigating topics such as scaling laws, or why Large Language Models do what they do, you may want to consider using a more general-purpose model, like [GPT-NeoX-20B](https://huggingface.co/EleutherAI/gpt-neox-20b), or [GPT-J-6B](https://huggingface.co/EleutherAI/gpt-j-6B).
16
+
17
+ Please note that all models in the *Pythia* suite were re-named in January 2023. For clarity, a <a href="#naming-convention-and-parameter-count">table comparing the old and new names</a> is provided in this model card, together with exact model parameter counts.
18
+
19
+ ## Pythia-70M-deduped
20
+
21
+ ### Model Details
22
+
23
+ - Developed by: [EleutherAI](http://eleuther.ai)
24
+ - Model type: Transformer-based Language Model
25
+ - Language: English
26
+ - Learn more: [Pythia's GitHub repository](https://github.com/EleutherAI/pythia) for training procedure, config files, and details on how to use.
27
+ - Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
28
+ - License: Apache 2.0
29
+ - Contact: to ask questions about this model, join the [EleutherAI Discord](https://discord.gg/zBGx3azzUn), and post them in `#release-discussion`. Please read the existing *Pythia* documentation before asking about it in the EleutherAI Discord. For general correspondence: [contact@eleuther.ai](mailto:contact@eleuther.ai).
30
+
31
+ <figure>
32
+
33
+ | Pythia model | Non-Embedding Params | Layers | Model Dim | Heads | Batch Size | Learning Rate | Equivalent Models |
34
+ | --: | --: | :-: | :-: | :-: | :-: | :-: | :-: |
35
+ | **70M** | **18,915,328** | **6** | **512** | **8** | **2M** | \\(1.0\times 10^{-3}\\) | — |
36
+ | 160M | 85,056,000 | 12 | 768 | 12 | 4M | \\(6.0\times 10^{-4}\\) | GPT-Neo 125M, OPT-125M |
37
+ | 410M | 302,311,424 | 24 | 1024 | 16 | 4M | \\(3.0\times 10^{-4}\\) | OPT-350M |
38
+ | 1.0B | 805,736,448 | 16 | 2048 | 8 | 2M | \\(3.0\times 10^{-4}\\) | — |
39
+ | 1.4B | 1,208,602,624 | 24 | 2048 | 16 | 4M | \\(2.0\times 10^{-4}\\) | GPT-Neo 1.3B, OPT-1.3B |
40
+ | 2.8B | 2,517,652,480 | 32 | 2560 | 32 | 2M | \\(1.6\times 10^{-4}\\) | GPT-Neo 2.7B, OPT-2.7B |
41
+ | 6.9B | 6,444,163,072 | 32 | 4096 | 32 | 2M | \\(1.2\times 10^{-4}\\) | OPT-6.7B |
42
+ | 12B | 11,327,027,200 | 36 | 5120 | 40 | 2M | \\(1.2\times 10^{-4}\\) | — |
43
+ <figcaption>Engineering details for the <i>Pythia Suite</i>. Deduped and non-deduped models of a given size have the same hyperparameters. “Equivalent” models have <b>exactly</b> the same architecture, and the same number of non-embedding parameters.</figcaption>
44
+ </figure>
45
+
46
+ ### Uses and Limitations
47
+
48
+ #### Intended Use
49
+
50
+ All *Pythia* models were developed specifically for research purposes. This suite is intended to provide a controlled setting for performing scientific experiments. To enable the study of how language models change over the course of training, we provide 143 evenly spaced intermediate checkpoints per model. These checkpoints are hosted on Hugging Face as branches. Note that branch `143000` corresponds exactly to the model checkpoint on the `main` branch of each model.
51
+
52
+ <section id="out">
53
+
54
+ #### Out-of-scope use
55
+
56
+ Performance on NLP benchmarks is not a priority for *Pythia* models, although its evaluation results are competitive with similarly-sized language models, such as those from the OPT and BLOOM suites.
57
+
58
+ Pythia-70M-deduped has not been fine-tuned for downstream tasks for which language models are commonly deployed, such as writing genre prose, or commercial chatbots. This means Pythia-70M-deduped will likely **not** respond to a given prompt the way e.g. ChatGPT does. This is because, unlike this model, ChatGPT was fine-tuned using Reinforcement Learning from Human Feedback (RLHF) to better “understand” human instructions.
59
+ </section>
60
+
61
+ #### Limitations and biases
62
+
63
+ The core functionality of a large language model is to take a string of text and predict the next token. The token deemed statistically most likely by the model need not produce the most “accurate” text. Never rely on Pythia-70M-deduped to produce factually accurate output.
64
+
65
+ This model was trained on [the Pile](https://pile.eleuther.ai/), a dataset known to contain profanity and texts that are lewd or otherwise offensive. See [Section 6 of the Pile paper](https://arxiv.org/abs/2101.00027) for a discussion of documented biases with regards to gender, religion, and race. Pythia-70M-deduped may produce socially unacceptable or undesirable text, *even if* the prompt itself does not include anything explicitly offensive.
66
+
67
+ If you plan on using text generated through, for example, the Hosted Inference API, we recommend having a human curate the outputs of this language model before presenting it to other people. Please inform your audience that the text was generated by Pythia-70M-deduped.
68
+
69
+ ### Quickstart
70
+
71
+ Pythia models can be loaded and used via the following code, demonstrated here for the third `pythia-70m-deduped` checkpoint:
72
+
73
+ ```python
74
+ from transformers import GPTNeoXForCausalLM, AutoTokenizer
75
+
76
+ model = GPTNeoXForCausalLM.from_pretrained(
77
+ "EleutherAI/pythia-70m-deduped",
78
+ revision="step3000",
79
+ cache_dir="./pythia-70m-deduped/step3000",
80
+ )
81
+
82
+ tokenizer = AutoTokenizer.from_pretrained(
83
+ "EleutherAI/pythia-70m-deduped",
84
+ revision="step3000",
85
+ cache_dir="./pythia-70m-deduped/step3000",
86
+ )
87
+
88
+ inputs = tokenizer("Hello, I am", return_tensors="pt")
89
+ tokens = model.generate(**inputs)
90
+ tokenizer.decode(tokens[0])
91
+ ```
92
+
93
+ Revision/branch `step143000` corresponds exactly to the model checkpoint on the `main` branch of each model.
94
+
95
+ For more information on how to use all Pythia models, see [documentation on GitHub](https://github.com/EleutherAI/pythia).
96
+
97
+ ### Training
98
+
99
+ #### Training data
100
+
101
+ Pythia-70M-deduped was trained on [the Pile](https://pile.eleuther.ai/) after it has been globally deduplicated.
102
+
103
+ [The Pile](https://pile.eleuther.ai/) is a 825GiB general-purpose dataset in English. It was created by EleutherAI specifically for training large language models. It contains texts from 22 diverse sources, roughly broken down into five categories: academic writing (e.g. arXiv), internet (e.g. CommonCrawl), prose (e.g. Project Gutenberg), dialogue (e.g. YouTube subtitles), and miscellaneous (e.g. GitHub, Enron Emails). See [the Pile paper](https://arxiv.org/abs/2101.00027) for a breakdown of all data sources, methodology, and a discussion of ethical implications. Consult [the datasheet](https://arxiv.org/abs/2201.07311) for more detailed documentation about the Pile and its component datasets. The Pile can be downloaded from the [official website](https://pile.eleuther.ai/), or from a [community mirror](https://the-eye.eu/public/AI/pile/).
104
+
105
+ #### Training procedure
106
+
107
+ All models were trained on the exact same data, in the exact same order. Each model saw 299,892,736,000 tokens during training, and 143 checkpoints for each model are saved every 2,097,152,000 tokens, spaced evenly throughout training. This corresponds to training for just under 1 epoch on the Pile for non-deduplicated models, and about 1.5 epochs on the deduplicated Pile.
108
+
109
+ All *Pythia* models trained for the equivalent of 143000 steps at a batch size of 2,097,152 tokens. Two batch sizes were used: 2M and 4M. Models with a batch size of 4M tokens listed were originally trained for 71500 steps instead, with checkpoints every 500 steps. The checkpoints on Hugging Face are renamed for consistency with all 2M batch models, so `step1000` is the first checkpoint for `pythia-1.4b` that was saved (corresponding to step 500 in training), and `step1000` is likewise the first `pythia-6.9b` checkpoint that was saved (corresponding to 1000 “actual” steps).
110
+
111
+ See [GitHub](https://github.com/EleutherAI/pythia) for more details on training procedure, including [how to reproduce it](https://github.com/EleutherAI/pythia/blob/main/README.md#reproducing-training).
112
+
113
+ ### Evaluations
114
+
115
+ All 16 *Pythia* models were evaluated using the [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness). You can access the results by model and step at `results/json/*` in the [GitHub repository](https://github.com/EleutherAI/pythia/tree/main/results/json).
116
+
117
+ January 2023 note: select evaluations and comparison with OPT and BLOOM models will be added here at a later date.
118
+
119
+ ### Naming convention and parameter count
120
+
121
+ *Pythia* models were re-named in January 2023. It is possible that the old naming convention still persists in some documentation by accident. The current naming convention (70M, 160M, etc.) is based on total parameter count.
122
+
123
+ <figure style="width:32em">
124
+
125
+ | current Pythia suffix | old suffix | total params | non-embedding params |
126
+ | --: | --: | --: | --: |
127
+ | 70M | 19M | 70,426,624 | 18,915,328 |
128
+ | 160M | 125M | 162,322,944 | 85,056,000 |
129
+ | 410M | 350M | 405,334,016 | 302,311,424 |
130
+ | 1B | 800M | 1,011,781,632 | 805,736,448 |
131
+ | 1.4B | 1.3B | 1,414,647,808 | 1,208,602,624 |
132
+ | 2.8B | 2.7B | 2,775,208,960 | 2,517,652,480 |
133
+ | 6.9B | 6.7B | 6,857,302,016 | 6,444,163,072 |
134
+ | 12B | 13B | 11,846,072,320 | 11,327,027,200 |
135
+ </figure>