EleutherAI
/

pile-t5-base

@@ -5,110 +5,120 @@ language:
 - en
 pipeline_tag: text2text-generation
 tags:
-- summarization
-- translation
 ---
-# Model Card for T5v2 Base
-#  Table of Contents
-1. [Model Details](#model-details)
-2. [Uses](#uses)
-3. [Bias, Risks, and Limitations](#bias-risks-and-limitations)
-4. [Training Details](#training-details)
-5. [Evaluation](#evaluation)
-6. [Environmental Impact](#environmental-impact)
-7. [Citation](#citation)
-8. [Model Card Authors](#model-card-authors)
-9. [How To Get Started With the Model](#how-to-get-started-with-the-model)
-# Model Details
-## Model Description
-More information needed.
-# Uses
-## Direct Use and Downstream Use
-More information needed.
-## Out-of-Scope Use
-More information needed.
-# Bias, Risks, and Limitations
-More information needed.
-## Recommendations
-More information needed.
-# Training Details
-## Training Data
-The model was pre-trained on the Pile using an unsupervised denoising objective,
-## Training Procedure
-More information needed.
-# Evaluation
-## Testing Data, Factors & Metrics
-More information needed.
-## Results
-More information needed.
-# Environmental Impact
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** Google Cloud TPU Pods
-- **Hours used:** More information needed
-- **Cloud Provider:** GCP
-- **Compute Region:** More information needed
-- **Carbon Emitted:** More information needed
-# Citation
-**BibTeX:**
 ```bibtex
 @article{2024t5v2,
   author  = {Lintang Sutawika and Aran Komatsuzaki and Colin Raffel},
-  title   = {T5v2, an update of T5},
   year    = {2024},
   url     = {}
 }
-```
-# How to Get Started with the Model
-Use the code below to get started with the model.
-<details>
-<summary> Click to expand </summary>
-```python
-from transformers import UMT5Tokenizer, UMT5Model
-tokenizer = UMT5Tokenizer.from_pretrained("EleutherAI/t5-v2-base")
-model = UMT5Model.from_pretrained("EleutherAI/t5-v2-base")
-input_ids = tokenizer(
-    "Studies have been shown that owning a dog is good for you", return_tensors="pt"
-).input_ids  # Batch size 1
-decoder_input_ids = tokenizer("Studies show that", return_tensors="pt").input_ids  # Batch size 1
-# forward pass
-outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
-last_hidden_states = outputs.last_hidden_state
-```
-</details>

 - en
 pipeline_tag: text2text-generation
 tags:
+- t5x
+- encode-decoder
 ---
+Pile-T5 Base is an Encoder-Decoder model trained on [the Pile](https://pile.eleuther.ai/) using the [T5x](https://github.com/google-research/t5x) library. The model was trained for 2 million steps or roughly 2 trillion tokens using MLM-objective similar to the original T5 model.
+### Model Details
+- Developed by: [EleutherAI](http://eleuther.ai)
+- Model type: Transformer-based Language Model
+- Language: English
+- Learn more: [Blogpost](). For details about the training dataset,
+see [the Pile paper](https://arxiv.org/abs/2101.00027), and [its data
+sheet](https://arxiv.org/abs/2201.07311).
+- License: Apache 2.0
+- Contact: to ask questions about this model, join the [EleutherAI
+Discord](https://discord.gg/zBGx3azzUn), and post them in `#release-discussion`.
+Please read the existing GPT-NeoX-20B documentation before asking about the model
+on Discord. For general correspondence: [contact@eleuther.
+ai](mailto:contact@eleuther.ai).
+### Uses and limitations
+#### Intended use
+Pile-T5 was developed primarily for research purposes. It learns an inner
+representation of the English language that can be used to extract features
+useful for downstream tasks.
+In addition to scientific uses, you may also further fine-tune and adapt
+Pile-T5 for deployment, as long as your use is in accordance with the
+Apache 2.0 license. This model works with the [Transformers
+Library](https://huggingface.co/docs/transformers/index). If you decide to use
+pre-trained Pile-T5 as a basis for your fine-tuned model, please note that
+you need to conduct your own risk and bias assessment.
+#### Out-of-scope use
+Pile-T5 is **not** intended for deployment as-is. It is not a product
+and cannot be used for human-facing interactions without supervision.
+Pile-T5 has not been fine-tuned for downstream tasks for which language
+models are commonly deployed, such as writing genre prose, or commercial
+chatbots. This means Pile-T5 will likely **not** respond to a given prompt
+the way products such as ChatGPT do. This is because, unlike Pile-T5,
+ChatGPT was fine-tuned using methods such as Reinforcement Learning from Human
+Feedback (RLHF) to better “understand” human instructions and dialogue.
+This model is English-language only, and thus cannot be used for translation
+or generating text in other languages.
+#### Limitations and biases
+The core functionality of Pile-T5 is to take a string of text that has been
+partially replaced with mask tokens and predict a sequence of tokens that would
+replace those mask tokens. Remember that the statistically most likely sequence
+of tokens need not result in the most “accurate” text. Never rely on Pile-T5 to produce
+factually accurate output.
+This model was trained on [the Pile](https://pile.eleuther.ai/), a dataset
+known to contain profanity and texts that are lewd or otherwise offensive.
+See [Section 6 of the Pile paper](https://arxiv.org/abs/2101.00027) for a
+discussion of documented biases with regards to gender, religion, and race.
+Pile-T5 may produce socially unacceptable or undesirable text, *even if*
+ the prompt itself does not include anything explicitly offensive.
+We recommend curating the outputs of this model before presenting it to a human
+reader. Please inform your audience that you are using artificially generated
+text.
+#### How to use
+Pile-T5 can be loaded using the `AutoModelForSeq2SeqLM` functionality:
+```python
+from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
+tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pile-t5-base")
+model = AutoModelForSeq2SeqLM.from_pretrained("EleutherAI/pile-t5-base")
+```
+### Training
+#### Training dataset
+The Pile is a 825GiB general-purpose dataset in English. It was created by
+EleutherAI specifically for training large language models. It contains texts
+from 22 diverse sources, roughly broken down into five categories: academic
+writing (e.g. arXiv), internet (e.g. CommonCrawl), prose (e.g. Project
+Gutenberg), dialogue (e.g. YouTube subtitles), and miscellaneous (e.g. GitHub,
+Enron Emails). See [the Pile paper](https://arxiv.org/abs/2101.00027) for
+a breakdown of all data sources, methodology, and a discussion of ethical
+implications. Consult [the datasheet](https://arxiv.org/abs/2201.07311) for
+more detailed documentation about the Pile and its component datasets. The
+Pile can be downloaded from the [official website](https://pile.eleuther.ai/),
+or from a [community mirror](https://the-eye.eu/public/AI/pile/).
+The Pile was deduplicated before being used to train Pile-T5.
+#### Training procedure
+Pile-T5 was trained with a batch size of approximately 1M tokens
+(2048 sequences of 512 tokens each), for a total of 2,000,000 steps.
+### Evaluations
+TBD
+### BibTeX
 ```bibtex
 @article{2024t5v2,
   author  = {Lintang Sutawika and Aran Komatsuzaki and Colin Raffel},
+  title   = {Pile T5, an update of T5},
   year    = {2024},
   url     = {}
 }
+```