lintang commited on
Commit
541538f
1 Parent(s): cc24732

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +82 -72
README.md CHANGED
@@ -5,110 +5,120 @@ language:
5
  - en
6
  pipeline_tag: text2text-generation
7
  tags:
8
- - summarization
9
- - translation
10
  ---
11
 
12
- # Model Card for T5v2 Base
13
 
14
- # Table of Contents
15
 
16
- 1. [Model Details](#model-details)
17
- 2. [Uses](#uses)
18
- 3. [Bias, Risks, and Limitations](#bias-risks-and-limitations)
19
- 4. [Training Details](#training-details)
20
- 5. [Evaluation](#evaluation)
21
- 6. [Environmental Impact](#environmental-impact)
22
- 7. [Citation](#citation)
23
- 8. [Model Card Authors](#model-card-authors)
24
- 9. [How To Get Started With the Model](#how-to-get-started-with-the-model)
 
 
 
25
 
26
- # Model Details
27
 
28
- ## Model Description
29
 
30
- More information needed.
31
- # Uses
 
32
 
33
- ## Direct Use and Downstream Use
 
 
 
 
 
34
 
35
- More information needed.
36
 
37
- ## Out-of-Scope Use
 
38
 
39
- More information needed.
 
 
 
 
 
40
 
41
- # Bias, Risks, and Limitations
 
42
 
43
- More information needed.
44
 
45
- ## Recommendations
 
 
 
 
46
 
47
- More information needed.
 
 
 
 
 
48
 
49
- # Training Details
 
 
50
 
51
- ## Training Data
52
 
53
- The model was pre-trained on the Pile using an unsupervised denoising objective,
54
- ## Training Procedure
 
55
 
56
- More information needed.
 
 
57
 
58
- # Evaluation
59
 
60
- ## Testing Data, Factors & Metrics
61
 
62
- More information needed.
63
- ## Results
 
 
 
 
 
 
 
 
 
64
 
65
- More information needed.
66
 
67
- # Environmental Impact
68
 
69
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
 
70
 
71
- - **Hardware Type:** Google Cloud TPU Pods
72
- - **Hours used:** More information needed
73
- - **Cloud Provider:** GCP
74
- - **Compute Region:** More information needed
75
- - **Carbon Emitted:** More information needed
76
 
77
- # Citation
78
 
79
- **BibTeX:**
80
 
81
  ```bibtex
82
  @article{2024t5v2,
83
  author = {Lintang Sutawika and Aran Komatsuzaki and Colin Raffel},
84
- title = {T5v2, an update of T5},
85
  year = {2024},
86
  url = {}
87
  }
88
- ```
89
-
90
- # How to Get Started with the Model
91
-
92
- Use the code below to get started with the model.
93
-
94
- <details>
95
- <summary> Click to expand </summary>
96
-
97
- ```python
98
- from transformers import UMT5Tokenizer, UMT5Model
99
-
100
- tokenizer = UMT5Tokenizer.from_pretrained("EleutherAI/t5-v2-base")
101
- model = UMT5Model.from_pretrained("EleutherAI/t5-v2-base")
102
-
103
- input_ids = tokenizer(
104
- "Studies have been shown that owning a dog is good for you", return_tensors="pt"
105
- ).input_ids # Batch size 1
106
- decoder_input_ids = tokenizer("Studies show that", return_tensors="pt").input_ids # Batch size 1
107
-
108
- # forward pass
109
- outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
110
- last_hidden_states = outputs.last_hidden_state
111
- ```
112
-
113
-
114
- </details>
 
5
  - en
6
  pipeline_tag: text2text-generation
7
  tags:
8
+ - t5x
9
+ - encode-decoder
10
  ---
11
 
12
+ Pile-T5 Base is an Encoder-Decoder model trained on [the Pile](https://pile.eleuther.ai/) using the [T5x](https://github.com/google-research/t5x) library. The model was trained for 2 million steps or roughly 2 trillion tokens using MLM-objective similar to the original T5 model.
13
 
14
+ ### Model Details
15
 
16
+ - Developed by: [EleutherAI](http://eleuther.ai)
17
+ - Model type: Transformer-based Language Model
18
+ - Language: English
19
+ - Learn more: [Blogpost](). For details about the training dataset,
20
+ see [the Pile paper](https://arxiv.org/abs/2101.00027), and [its data
21
+ sheet](https://arxiv.org/abs/2201.07311).
22
+ - License: Apache 2.0
23
+ - Contact: to ask questions about this model, join the [EleutherAI
24
+ Discord](https://discord.gg/zBGx3azzUn), and post them in `#release-discussion`.
25
+ Please read the existing GPT-NeoX-20B documentation before asking about the model
26
+ on Discord. For general correspondence: [contact@eleuther.
27
+ ai](mailto:contact@eleuther.ai).
28
 
29
+ ### Uses and limitations
30
 
31
+ #### Intended use
32
 
33
+ Pile-T5 was developed primarily for research purposes. It learns an inner
34
+ representation of the English language that can be used to extract features
35
+ useful for downstream tasks.
36
 
37
+ In addition to scientific uses, you may also further fine-tune and adapt
38
+ Pile-T5 for deployment, as long as your use is in accordance with the
39
+ Apache 2.0 license. This model works with the [Transformers
40
+ Library](https://huggingface.co/docs/transformers/index). If you decide to use
41
+ pre-trained Pile-T5 as a basis for your fine-tuned model, please note that
42
+ you need to conduct your own risk and bias assessment.
43
 
44
+ #### Out-of-scope use
45
 
46
+ Pile-T5 is **not** intended for deployment as-is. It is not a product
47
+ and cannot be used for human-facing interactions without supervision.
48
 
49
+ Pile-T5 has not been fine-tuned for downstream tasks for which language
50
+ models are commonly deployed, such as writing genre prose, or commercial
51
+ chatbots. This means Pile-T5 will likely **not** respond to a given prompt
52
+ the way products such as ChatGPT do. This is because, unlike Pile-T5,
53
+ ChatGPT was fine-tuned using methods such as Reinforcement Learning from Human
54
+ Feedback (RLHF) to better “understand” human instructions and dialogue.
55
 
56
+ This model is English-language only, and thus cannot be used for translation
57
+ or generating text in other languages.
58
 
59
+ #### Limitations and biases
60
 
61
+ The core functionality of Pile-T5 is to take a string of text that has been
62
+ partially replaced with mask tokens and predict a sequence of tokens that would
63
+ replace those mask tokens. Remember that the statistically most likely sequence
64
+ of tokens need not result in the most “accurate” text. Never rely on Pile-T5 to produce
65
+ factually accurate output.
66
 
67
+ This model was trained on [the Pile](https://pile.eleuther.ai/), a dataset
68
+ known to contain profanity and texts that are lewd or otherwise offensive.
69
+ See [Section 6 of the Pile paper](https://arxiv.org/abs/2101.00027) for a
70
+ discussion of documented biases with regards to gender, religion, and race.
71
+ Pile-T5 may produce socially unacceptable or undesirable text, *even if*
72
+ the prompt itself does not include anything explicitly offensive.
73
 
74
+ We recommend curating the outputs of this model before presenting it to a human
75
+ reader. Please inform your audience that you are using artificially generated
76
+ text.
77
 
78
+ #### How to use
79
 
80
+ Pile-T5 can be loaded using the `AutoModelForSeq2SeqLM` functionality:
81
+ ```python
82
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
83
 
84
+ tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pile-t5-base")
85
+ model = AutoModelForSeq2SeqLM.from_pretrained("EleutherAI/pile-t5-base")
86
+ ```
87
 
88
+ ### Training
89
 
90
+ #### Training dataset
91
 
92
+ The Pile is a 825GiB general-purpose dataset in English. It was created by
93
+ EleutherAI specifically for training large language models. It contains texts
94
+ from 22 diverse sources, roughly broken down into five categories: academic
95
+ writing (e.g. arXiv), internet (e.g. CommonCrawl), prose (e.g. Project
96
+ Gutenberg), dialogue (e.g. YouTube subtitles), and miscellaneous (e.g. GitHub,
97
+ Enron Emails). See [the Pile paper](https://arxiv.org/abs/2101.00027) for
98
+ a breakdown of all data sources, methodology, and a discussion of ethical
99
+ implications. Consult [the datasheet](https://arxiv.org/abs/2201.07311) for
100
+ more detailed documentation about the Pile and its component datasets. The
101
+ Pile can be downloaded from the [official website](https://pile.eleuther.ai/),
102
+ or from a [community mirror](https://the-eye.eu/public/AI/pile/).
103
 
104
+ The Pile was deduplicated before being used to train Pile-T5.
105
 
106
+ #### Training procedure
107
 
108
+ Pile-T5 was trained with a batch size of approximately 1M tokens
109
+ (2048 sequences of 512 tokens each), for a total of 2,000,000 steps.
110
 
111
+ ### Evaluations
 
 
 
 
112
 
113
+ TBD
114
 
115
+ ### BibTeX
116
 
117
  ```bibtex
118
  @article{2024t5v2,
119
  author = {Lintang Sutawika and Aran Komatsuzaki and Colin Raffel},
120
+ title = {Pile T5, an update of T5},
121
  year = {2024},
122
  url = {}
123
  }
124
+ ```