EleutherAI
/

pythia-2.8b-deduped-v0

+---
+language:
+- en
+tags:
+- pytorch
+- causal-lm
+- pythia
+license: apache-2.0
+datasets:
+- EleutherAI/raw_deduplicated_pile
+---
+The *Pythia Scaling Suite* is a collection of models developed to facilitate
+interpretability research. It contains two sets of eight models of sizes
+70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, and 12B. For each size, there are two
+models: one trained on the Pile, and one trained on the Pile after the dataset
+has been globally deduplicated. All 8 model sizes are trained on the exact
+same data, in the exact same order. All Pythia models are available
+[on Hugging Face](https://huggingface.co/EleutherAI).
+The Pythia model suite was deliberately designed to promote scientific
+research on large language models, especially interpretability research.
+Despite not centering downstream performance as a design goal, we find the
+models match or exceed the performance of similar and same-sized models,
+such as those in the OPT and GPT-Neo suites.
+Please note that all models in the *Pythia* suite were renamed in January
+2023. For clarity, a <a href="#naming-convention-and-parameter-count">table
+comparing the old and new names</a> is provided in this model card, together
+with exact model parameter counts.
+## Pythia-2.8B-deduped
+### Model Details
+- Developed by: [EleutherAI](http://eleuther.ai)
+- Model type: Transformer-based Language Model
+- Language: English
+- Learn more: [Pythia's GitHub repository](https://github.com/EleutherAI/pythia)
+ for training procedure, config files, and details on how to use.
+- Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
+- License: Apache 2.0
+- Contact: to ask questions about this model, join the [EleutherAI
+Discord](https://discord.gg/zBGx3azzUn), and post them in `#release-discussion`.
+ Please read the existing *Pythia* documentation before asking about it in the
+ EleutherAI Discord. For general correspondence: [contact@eleuther.
+ ai](mailto:contact@eleuther.ai).
+<figure>
+| Pythia model | Non-Embedding Params | Layers | Model Dim | Heads | Batch Size | Learning Rate         | Equivalent Models      |
+| -----------: | -------------------: | :----: | :-------: | :---: | :--------: | :-------------------: | :--------------------: |
+| 70M          | 18,915,328           | 6      | 512       | 8     | 2M         | 1.0 x 10<sup>-3</sup> | —                      |
+| 160M         | 85,056,000           | 12     | 768       | 12    | 4M         | 6.0 x 10<sup>-4</sup> | GPT-Neo 125M, OPT-125M |
+| 410M         | 302,311,424          | 24     | 1024      | 16    | 4M         | 3.0 x 10<sup>-4</sup> | OPT-350M               |
+| 1.0B         | 805,736,448          | 16     | 2048      | 8     | 2M         | 3.0 x 10<sup>-4</sup> | —                      |
+| 1.4B         | 1,208,602,624        | 24     | 2048      | 16    | 4M         | 2.0 x 10<sup>-4</sup> | GPT-Neo 1.3B, OPT-1.3B |
+| 2.8B         | 2,517,652,480        | 32     | 2560      | 32    | 2M         | 1.6 x 10<sup>-4</sup> | GPT-Neo 2.7B, OPT-2.7B |
+| 6.9B         | 6,444,163,072        | 32     | 4096      | 32    | 2M         | 1.2 x 10<sup>-4</sup> | OPT-6.7B               |
+| 12B          | 11,327,027,200       | 36     | 5120      | 40    | 2M         | 1.2 x 10<sup>-4</sup> | —                      |
+<figcaption>Engineering details for the <i>Pythia Suite</i>. Deduped and
+non-deduped models of a given size have the same hyperparameters. “Equivalent”
+models have <b>exactly</b> the same architecture, and the same number of
+non-embedding parameters.</figcaption>
+</figure>
+### Uses and Limitations
+#### Intended Use
+The primary intended use of Pythia is research on the behavior, functionality,
+and limitations of large language models. This suite is intended to provide
+a controlled setting for performing scientific experiments. To enable the
+study of how language models change in the course of training, we provide
+143 evenly spaced intermediate checkpoints per model. These checkpoints are
+hosted on Hugging Face as branches. Note that branch `143000` corresponds
+exactly to the model checkpoint on the `main` branch of each model.
+You may also further fine-tune and adapt Pythia-2.8B-deduped for deployment,
+as long as your use is in accordance with the Apache 2.0 license. Pythia
+models work with the Hugging Face [Transformers
+Library](https://huggingface.co/docs/transformers/index). If you decide to use
+pre-trained Pythia-2.8B-deduped as a basis for your fine-tuned model, please
+conduct your own risk and bias assessment.
+#### Out-of-scope use
+The Pythia Suite is **not** intended for deployment. It is not a in itself
+a product and cannot be used for human-facing interactions.
+Pythia models are English-language only, and are not suitable for translation
+or generating text in other languages.
+Pythia-2.8B-deduped has not been fine-tuned for downstream contexts in which
+language models are commonly deployed, such as writing genre prose,
+or commercial chatbots. This means Pythia-2.8B-deduped will **not**
+respond to a given prompt the way a product like ChatGPT does. This is because,
+ unlike this model, ChatGPT was fine-tuned using methods such as Reinforcement
+Learning from Human Feedback (RLHF) to better “understand” human instructions.
+#### Limitations and biases
+The core functionality of a large language model is to take a string of text
+and predict the next token. The token deemed statistically most likely by the
+model need not produce the most “accurate” text. Never rely on
+Pythia-2.8B-deduped to produce factually accurate output.
+This model was trained on [the Pile](https://pile.eleuther.ai/), a dataset
+known to contain profanity and texts that are lewd or otherwise offensive.
+See [Section 6 of the Pile paper](https://arxiv.org/abs/2101.00027) for a
+discussion of documented biases with regards to gender, religion, and race.
+Pythia-2.8B-deduped may produce socially unacceptable or undesirable text,
+*even if* the prompt itself does not include anything explicitly offensive.
+If you plan on using text generated through, for example, the Hosted Inference
+API, we recommend having a human curate the outputs of this language model
+before presenting it to other people. Please inform your audience that the
+text was generated by Pythia-2.8B-deduped.
+### Quickstart
+Pythia models can be loaded and used via the following code, demonstrated here
+for the third `pythia-70m-deduped` checkpoint:
+```python
+from transformers import GPTNeoXForCausalLM, AutoTokenizer
+model = GPTNeoXForCausalLM.from_pretrained(
+  "EleutherAI/pythia-70m-deduped",
+  revision="step3000",
+  cache_dir="./pythia-70m-deduped/step3000",
+)
+tokenizer = AutoTokenizer.from_pretrained(
+  "EleutherAI/pythia-70m-deduped",
+  revision="step3000",
+  cache_dir="./pythia-70m-deduped/step3000",
+)
+inputs = tokenizer("Hello, I am", return_tensors="pt")
+tokens = model.generate(**inputs)
+tokenizer.decode(tokens[0])
+```
+Revision/branch `step143000` corresponds exactly to the model checkpoint on
+the `main` branch of each model.
+For more information on how to use all Pythia models, see [documentation on
+GitHub](https://github.com/EleutherAI/pythia).
+### Training
+#### Training data
+Pythia-2.8B-deduped was trained on the Pile **after the dataset has been
+globally deduplicated**.
+[The Pile](https://pile.eleuther.ai/) is a 825GiB general-purpose dataset in
+English. It was created by EleutherAI specifically for training large language
+models. It contains texts from 22 diverse sources, roughly broken down into
+five categories: academic writing (e.g. arXiv), internet (e.g. CommonCrawl),
+prose (e.g. Project Gutenberg), dialogue (e.g. YouTube subtitles), and
+miscellaneous (e.g. GitHub, Enron Emails). See [the Pile
+paper](https://arxiv.org/abs/2101.00027) for a breakdown of all data sources,
+methodology, and a discussion of ethical implications. Consult [the
+datasheet](https://arxiv.org/abs/2201.07311) for more detailed documentation
+about the Pile and its component datasets. The Pile can be downloaded from
+the [official website](https://pile.eleuther.ai/), or from a [community
+mirror](https://the-eye.eu/public/AI/pile/).
+#### Training procedure
+Pythia uses the same tokenizer as [GPT-NeoX-
+20B](https://huggingface.co/EleutherAI/gpt-neox-20b).
+All models were trained on the exact same data, in the exact same order. Each
+model saw 299,892,736,000 tokens during training, and 143 checkpoints for each
+model are saved every 2,097,152,000 tokens, spaced evenly throughout training.
+This corresponds to training for just under 1 epoch on the Pile for
+non-deduplicated models, and about 1.5 epochs on the deduplicated Pile.
+All *Pythia* models trained for the equivalent of 143000 steps at a batch size
+of 2,097,152 tokens. Two batch sizes were used: 2M and 4M. Models with a batch
+size of 4M tokens listed were originally trained for 71500 steps instead, with
+checkpoints every 500 steps. The checkpoints on Hugging Face are renamed for
+consistency with all 2M batch models, so `step1000` is the first checkpoint
+for `pythia-1.4b` that was saved (corresponding to step 500 in training), and
+`step1000` is likewise the first `pythia-6.9b` checkpoint that was saved
+(corresponding to 1000 “actual” steps).
+See [GitHub](https://github.com/EleutherAI/pythia) for more details on training
+ procedure, including [how to reproduce
+ it](https://github.com/EleutherAI/pythia/blob/main/README.md#reproducing-training).
+### Evaluations
+All 16 *Pythia* models were evaluated using the [LM Evaluation
+Harness](https://github.com/EleutherAI/lm-evaluation-harness). You can access
+the results by model and step at `results/json/*` in the [GitHub
+repository](https://github.com/EleutherAI/pythia/tree/main/results/json).
+February 2023 note: select evaluations and comparison with OPT and BLOOM
+models will be added here at a later date.
+### Naming convention and parameter count
+*Pythia* models were renamed in January 2023. It is possible that the old
+naming convention still persists in some documentation by accident. The
+current naming convention (70M, 160M, etc.) is based on total parameter count.
+<figure style="width:32em">
+| current Pythia suffix | old suffix | total params   | non-embedding params |
+| --------------------: | ---------: | -------------: | -------------------: |
+| 70M                   | 19M        | 70,426,624     | 18,915,328           |
+| 160M                  | 125M       | 162,322,944    | 85,056,000           |
+| 410M                  | 350M       | 405,334,016    | 302,311,424          |
+| 1B                    | 800M       | 1,011,781,632  | 805,736,448          |
+| 1.4B                  | 1.3B       | 1,414,647,808  | 1,208,602,624        |
+| 2.8B                  | 2.7B       | 2,775,208,960  | 2,517,652,480        |
+| 6.9B                  | 6.7B       | 6,857,302,016  | 6,444,163,072        |
+| 12B                   | 13B        | 11,846,072,320 | 11,327,027,200       |
+</figure>