EleutherAI
/

pythia-70m-v0

@@ -16,17 +16,18 @@ interpretability research. It contains two sets of eight models of sizes
 models: one trained on the Pile, and one trained on the Pile after the dataset
 has been globally deduplicated. All 8 model sizes are trained on the exact
 same data, in the exact same order. All Pythia models are available
-[on Hugging Face](https://huggingface.co/EleutherAI).
-Some design choices were made for the sake of interpretability research and
-to ensure consistency across all models. However, the Pythia models are
-competitive with, or mildly outperform, other similar and same-sized models,
-such as OPT and the GPT-Neo suite.
-Please note that all models in the *Pythia* suite were re-named in January
 2023. For clarity, a <a href="#naming-convention-and-parameter-count">table
 comparing the old and new names</a> is provided in this model card, together
-with exact model parameter counts.
 ## Pythia-70M
@@ -39,11 +40,11 @@ with exact model parameter counts.
  for training procedure, config files, and details on how to use.
 - Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
 - License: Apache 2.0
-- Contact: to ask questions about this model, join the [EleutherAI
- Discord](https://discord.gg/zBGx3azzUn), and post them in `#release-discussion`.
- Please read the existing *Pythia* documentation before asking about it in the
-EleutherAI Discord. For general correspondence:
-[contact@eleuther.ai](mailto:contact@eleuther.ai).
 <figure>
@@ -67,26 +68,35 @@ non-embedding parameters.</figcaption>
 #### Intended Use
-All Pythia models were developed specifically for research purposes. This
-suite is intended to provide a controlled setting for performing scientific
-experiments. To enable the study of how language models change over the course
-of training, we provide 143 evenly spaced intermediate checkpoints per model.
-These checkpoints are hosted on Hugging Face as branches. Note that branch
-`143000` corresponds exactly to the model checkpoint on the `main` branch
-of each model.
 #### Out-of-scope use
-Performance on NLP benchmarks is not a priority for *Pythia* models, although
-its evaluation results are competitive with similarly-sized language models,
-such as those from the OPT and BLOOM suites.
-Pythia-70M has not been fine-tuned for downstream tasks for which
 language models are commonly deployed, such as writing genre prose,
-or commercial chatbots. This means Pythia-70M will likely **not**
-respond to a given prompt the way e.g. ChatGPT does. This is because, unlike
-this model, ChatGPT was fine-tuned using Reinforcement Learning from Human
-Feedback (RLHF) to better “understand” human instructions.
 #### Limitations and biases
@@ -99,8 +109,8 @@ This model was trained on [the Pile](https://pile.eleuther.ai/), a dataset
 known to contain profanity and texts that are lewd or otherwise offensive.
 See [Section 6 of the Pile paper](https://arxiv.org/abs/2101.00027) for a
 discussion of documented biases with regards to gender, religion, and race.
-Pythia-70M may produce socially unacceptable or undesirable text,
-*even if* the prompt itself does not include anything explicitly offensive.
 If you plan on using text generated through, for example, the Hosted Inference
 API, we recommend having a human curate the outputs of this language model
@@ -133,8 +143,7 @@ tokenizer.decode(tokens[0])
 ```
 Revision/branch `step143000` corresponds exactly to the model checkpoint on
-the `main` branch of each model.
 For more information on how to use all Pythia models, see [documentation on
 GitHub](https://github.com/EleutherAI/pythia).
@@ -153,8 +162,7 @@ methodology, and a discussion of ethical implications. Consult [the
 datasheet](https://arxiv.org/abs/2201.07311) for more detailed documentation
 about the Pile and its component datasets. The Pile can be downloaded from
 the [official website](https://pile.eleuther.ai/), or from a [community
-mirror](https://the-eye.eu/public/AI/pile/).
 The Pile was **not** deduplicated before being used to train Pythia-70M.
 #### Training procedure
@@ -165,32 +173,57 @@ model are saved every 2,097,152,000 tokens, spaced evenly throughout training.
 This corresponds to training for just under 1 epoch on the Pile for
 non-deduplicated models, and about 1.5 epochs on the deduplicated Pile.
-All Pythia models trained for the equivalent of 143000 steps at a batch size
 of 2,097,152 tokens. Two batch sizes were used: 2M and 4M. Models with a batch
 size of 4M tokens listed were originally trained for 71500 steps instead, with
 checkpoints every 500 steps. The checkpoints on Hugging Face are renamed for
 consistency with all 2M batch models, so `step1000` is the first checkpoint
 for `pythia-1.4b` that was saved (corresponding to step 500 in training), and
 `step1000` is likewise the first `pythia-6.9b` checkpoint that was saved
-(corresponding to 1000 “actual” steps).
 See [GitHub](https://github.com/EleutherAI/pythia) for more details on training
  procedure, including [how to reproduce
-it](https://github.com/EleutherAI/pythia/blob/main/README.md#reproducing-training).
 ### Evaluations
 All 16 *Pythia* models were evaluated using the [LM Evaluation
 Harness](https://github.com/EleutherAI/lm-evaluation-harness). You can access
 the results by model and step at `results/json/*` in the [GitHub
-repository](https://github.com/EleutherAI/pythia/tree/main/results/json).
-February 2023 note: select evaluations and comparison with OPT and BLOOM
-models will be added here at a later date.
 ### Naming convention and parameter count
-Pythia models were re-named in January 2023. It is possible that the old
 naming convention still persists in some documentation by accident. The
 current naming convention (70M, 160M, etc.) is based on total parameter count.

 models: one trained on the Pile, and one trained on the Pile after the dataset
 has been globally deduplicated. All 8 model sizes are trained on the exact
 same data, in the exact same order. All Pythia models are available
+[on Hugging Face](https://huggingface.co/models?other=pythia).
+The Pythia model suite was deliberately designed to promote scientific
+research on large language models, especially interpretability research.
+Despite not centering downstream performance as a design goal, we find the
+models <a href="#evaluations">match or exceed</a> the performance of
+similar and same-sized models, such as those in the OPT and GPT-Neo suites.
+Please note that all models in the *Pythia* suite were renamed in January
 2023. For clarity, a <a href="#naming-convention-and-parameter-count">table
 comparing the old and new names</a> is provided in this model card, together
+with exact parameter counts.
 ## Pythia-70M
  for training procedure, config files, and details on how to use.
 - Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
 - License: Apache 2.0
+- Contact: to ask questions about this model, join the [EleutherAI
+Discord](https://discord.gg/zBGx3azzUn), and post them in `#release-discussion`.
+ Please read the existing *Pythia* documentation before asking about it in the
+ EleutherAI Discord. For general correspondence: [contact@eleuther.
+ ai](mailto:contact@eleuther.ai).
 <figure>
 #### Intended Use
+The primary intended use of Pythia is research on the behavior, functionality,
+and limitations of large language models. This suite is intended to provide
+a controlled setting for performing scientific experiments. To enable the
+study of how language models change over the course of training, we provide
+143 evenly spaced intermediate checkpoints per model. These checkpoints are
+hosted on Hugging Face as branches. Note that branch `143000` corresponds
+exactly to the model checkpoint on the `main` branch of each model.
+You may also further fine-tune and adapt Pythia-70M for deployment,
+as long as your use is in accordance with the Apache 2.0 license. Pythia
+models work with the Hugging Face [Transformers
+Library](https://huggingface.co/docs/transformers/index). If you decide to use
+pre-trained Pythia-70M as a basis for your fine-tuned model, please
+conduct your own risk and bias assessment.
 #### Out-of-scope use
+The Pythia Suite is **not** intended for deployment. It is not a in itself
+a product and cannot be used for human-facing interactions.
+Pythia models are English-language only, and are not suitable for translation
+or generating text in other languages.
+Pythia-70M has not been fine-tuned for downstream contexts in which
 language models are commonly deployed, such as writing genre prose,
+or commercial chatbots. This means Pythia-70M will **not**
+respond to a given prompt the way a product like ChatGPT does. This is because,
+ unlike this model, ChatGPT was fine-tuned using methods such as Reinforcement
+Learning from Human Feedback (RLHF) to better “understand” human instructions.
 #### Limitations and biases
 known to contain profanity and texts that are lewd or otherwise offensive.
 See [Section 6 of the Pile paper](https://arxiv.org/abs/2101.00027) for a
 discussion of documented biases with regards to gender, religion, and race.
+Pythia-70M may produce socially unacceptable or undesirable text, *even if*
+the prompt itself does not include anything explicitly offensive.
 If you plan on using text generated through, for example, the Hosted Inference
 API, we recommend having a human curate the outputs of this language model
 ```
 Revision/branch `step143000` corresponds exactly to the model checkpoint on
+the `main` branch of each model.<br>
 For more information on how to use all Pythia models, see [documentation on
 GitHub](https://github.com/EleutherAI/pythia).
 datasheet](https://arxiv.org/abs/2201.07311) for more detailed documentation
 about the Pile and its component datasets. The Pile can be downloaded from
 the [official website](https://pile.eleuther.ai/), or from a [community
+mirror](https://the-eye.eu/public/AI/pile/).<br>
 The Pile was **not** deduplicated before being used to train Pythia-70M.
 #### Training procedure
 This corresponds to training for just under 1 epoch on the Pile for
 non-deduplicated models, and about 1.5 epochs on the deduplicated Pile.
+All *Pythia* models trained for the equivalent of 143000 steps at a batch size
 of 2,097,152 tokens. Two batch sizes were used: 2M and 4M. Models with a batch
 size of 4M tokens listed were originally trained for 71500 steps instead, with
 checkpoints every 500 steps. The checkpoints on Hugging Face are renamed for
 consistency with all 2M batch models, so `step1000` is the first checkpoint
 for `pythia-1.4b` that was saved (corresponding to step 500 in training), and
 `step1000` is likewise the first `pythia-6.9b` checkpoint that was saved
+(corresponding to 1000 “actual” steps).<br>
 See [GitHub](https://github.com/EleutherAI/pythia) for more details on training
  procedure, including [how to reproduce
+ it](https://github.com/EleutherAI/pythia/blob/main/README.md#reproducing-training).<br>
+Pythia uses the same tokenizer as [GPT-NeoX-
+20B](https://huggingface.co/EleutherAI/gpt-neox-20b).
 ### Evaluations
 All 16 *Pythia* models were evaluated using the [LM Evaluation
 Harness](https://github.com/EleutherAI/lm-evaluation-harness). You can access
 the results by model and step at `results/json/*` in the [GitHub
+repository](https://github.com/EleutherAI/pythia/tree/main/results/json).<br>
+Expand the sections below to see plots of evaluation results for all
+Pythia and Pythia-deduped models compared with OPT and BLOOM.
+<details>
+  <summary>LAMBADA – OpenAI</summary>
+  <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/lambada_openai.png" style="width:auto"/>
+</details>
+<details>
+  <summary>Physical Interaction: Question Answering (PIQA)</summary>
+  <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/piqa.png" style="width:auto"/>
+</details>
+<details>
+  <summary>WinoGrande</summary>
+  <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/winogrande.png" style="width:auto"/>
+</details>
+<details>
+  <summary>AI2 Reasoning Challenge—Challenge Set</summary>
+  <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/arc_challenge.png" style="width:auto"/>
+</details>
+<details>
+  <summary>SciQ</summary>
+  <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/sciq.png" style="width:auto"/>
+</details>
 ### Naming convention and parameter count
+*Pythia* models were renamed in January 2023. It is possible that the old
 naming convention still persists in some documentation by accident. The
 current naming convention (70M, 160M, etc.) is based on total parameter count.