RichardErkhov commited on
Commit
54d844e
1 Parent(s): 390ce25

uploaded readme

Browse files
Files changed (1) hide show
  1. README.md +262 -0
README.md ADDED
@@ -0,0 +1,262 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Quantization made by Richard Erkhov.
2
+
3
+ [Github](https://github.com/RichardErkhov)
4
+
5
+ [Discord](https://discord.gg/pvy7H8DZMG)
6
+
7
+ [Request more models](https://github.com/RichardErkhov/quant_request)
8
+
9
+
10
+ pythia-70m-deduped-v0 - bnb 4bits
11
+ - Model creator: https://huggingface.co/EleutherAI/
12
+ - Original model: https://huggingface.co/EleutherAI/pythia-70m-deduped-v0/
13
+
14
+
15
+
16
+
17
+ Original model description:
18
+ ---
19
+ language:
20
+ - en
21
+ tags:
22
+ - pytorch
23
+ - causal-lm
24
+ - pythia
25
+ - pythia_v0
26
+ license: apache-2.0
27
+ datasets:
28
+ - EleutherAI/the_pile_deduplicated
29
+ ---
30
+
31
+ The *Pythia Scaling Suite* is a collection of models developed to facilitate
32
+ interpretability research. It contains two sets of eight models of sizes
33
+ 70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, and 12B. For each size, there are two
34
+ models: one trained on the Pile, and one trained on the Pile after the dataset
35
+ has been globally deduplicated. All 8 model sizes are trained on the exact
36
+ same data, in the exact same order. All Pythia models are available
37
+ [on Hugging Face](https://huggingface.co/models?other=pythia).
38
+
39
+ The Pythia model suite was deliberately designed to promote scientific
40
+ research on large language models, especially interpretability research.
41
+ Despite not centering downstream performance as a design goal, we find the
42
+ models <a href="#evaluations">match or exceed</a> the performance of
43
+ similar and same-sized models, such as those in the OPT and GPT-Neo suites.
44
+
45
+ Please note that all models in the *Pythia* suite were renamed in January
46
+ 2023. For clarity, a <a href="#naming-convention-and-parameter-count">table
47
+ comparing the old and new names</a> is provided in this model card, together
48
+ with exact parameter counts.
49
+
50
+ ## Pythia-70M-deduped
51
+
52
+ ### Model Details
53
+
54
+ - Developed by: [EleutherAI](http://eleuther.ai)
55
+ - Model type: Transformer-based Language Model
56
+ - Language: English
57
+ - Learn more: [Pythia's GitHub repository](https://github.com/EleutherAI/pythia)
58
+ for training procedure, config files, and details on how to use.
59
+ - Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
60
+ - License: Apache 2.0
61
+ - Contact: to ask questions about this model, join the [EleutherAI
62
+ Discord](https://discord.gg/zBGx3azzUn), and post them in `#release-discussion`.
63
+ Please read the existing *Pythia* documentation before asking about it in the
64
+ EleutherAI Discord. For general correspondence: [contact@eleuther.
65
+ ai](mailto:contact@eleuther.ai).
66
+
67
+ <figure>
68
+
69
+ | Pythia model | Non-Embedding Params | Layers | Model Dim | Heads | Batch Size | Learning Rate | Equivalent Models |
70
+ | -----------: | -------------------: | :----: | :-------: | :---: | :--------: | :-------------------: | :--------------------: |
71
+ | 70M | 18,915,328 | 6 | 512 | 8 | 2M | 1.0 x 10<sup>-3</sup> | — |
72
+ | 160M | 85,056,000 | 12 | 768 | 12 | 4M | 6.0 x 10<sup>-4</sup> | GPT-Neo 125M, OPT-125M |
73
+ | 410M | 302,311,424 | 24 | 1024 | 16 | 4M | 3.0 x 10<sup>-4</sup> | OPT-350M |
74
+ | 1.0B | 805,736,448 | 16 | 2048 | 8 | 2M | 3.0 x 10<sup>-4</sup> | — |
75
+ | 1.4B | 1,208,602,624 | 24 | 2048 | 16 | 4M | 2.0 x 10<sup>-4</sup> | GPT-Neo 1.3B, OPT-1.3B |
76
+ | 2.8B | 2,517,652,480 | 32 | 2560 | 32 | 2M | 1.6 x 10<sup>-4</sup> | GPT-Neo 2.7B, OPT-2.7B |
77
+ | 6.9B | 6,444,163,072 | 32 | 4096 | 32 | 2M | 1.2 x 10<sup>-4</sup> | OPT-6.7B |
78
+ | 12B | 11,327,027,200 | 36 | 5120 | 40 | 2M | 1.2 x 10<sup>-4</sup> | — |
79
+ <figcaption>Engineering details for the <i>Pythia Suite</i>. Deduped and
80
+ non-deduped models of a given size have the same hyperparameters. “Equivalent”
81
+ models have <b>exactly</b> the same architecture, and the same number of
82
+ non-embedding parameters.</figcaption>
83
+ </figure>
84
+
85
+ ### Uses and Limitations
86
+
87
+ #### Intended Use
88
+
89
+ The primary intended use of Pythia is research on the behavior, functionality,
90
+ and limitations of large language models. This suite is intended to provide
91
+ a controlled setting for performing scientific experiments. To enable the
92
+ study of how language models change in the course of training, we provide
93
+ 143 evenly spaced intermediate checkpoints per model. These checkpoints are
94
+ hosted on Hugging Face as branches. Note that branch `143000` corresponds
95
+ exactly to the model checkpoint on the `main` branch of each model.
96
+
97
+ You may also further fine-tune and adapt Pythia-70M-deduped for deployment,
98
+ as long as your use is in accordance with the Apache 2.0 license. Pythia
99
+ models work with the Hugging Face [Transformers
100
+ Library](https://huggingface.co/docs/transformers/index). If you decide to use
101
+ pre-trained Pythia-70M-deduped as a basis for your fine-tuned model, please
102
+ conduct your own risk and bias assessment.
103
+
104
+ #### Out-of-scope use
105
+
106
+ The Pythia Suite is **not** intended for deployment. It is not a in itself
107
+ a product and cannot be used for human-facing interactions.
108
+
109
+ Pythia models are English-language only, and are not suitable for translation
110
+ or generating text in other languages.
111
+
112
+ Pythia-70M-deduped has not been fine-tuned for downstream contexts in which
113
+ language models are commonly deployed, such as writing genre prose,
114
+ or commercial chatbots. This means Pythia-70M-deduped will **not**
115
+ respond to a given prompt the way a product like ChatGPT does. This is because,
116
+ unlike this model, ChatGPT was fine-tuned using methods such as Reinforcement
117
+ Learning from Human Feedback (RLHF) to better “understand” human instructions.
118
+
119
+ #### Limitations and biases
120
+
121
+ The core functionality of a large language model is to take a string of text
122
+ and predict the next token. The token deemed statistically most likely by the
123
+ model need not produce the most “accurate” text. Never rely on
124
+ Pythia-70M-deduped to produce factually accurate output.
125
+
126
+ This model was trained on [the Pile](https://pile.eleuther.ai/), a dataset
127
+ known to contain profanity and texts that are lewd or otherwise offensive.
128
+ See [Section 6 of the Pile paper](https://arxiv.org/abs/2101.00027) for a
129
+ discussion of documented biases with regards to gender, religion, and race.
130
+ Pythia-70M-deduped may produce socially unacceptable or undesirable text,
131
+ *even if* the prompt itself does not include anything explicitly offensive.
132
+
133
+ If you plan on using text generated through, for example, the Hosted Inference
134
+ API, we recommend having a human curate the outputs of this language model
135
+ before presenting it to other people. Please inform your audience that the
136
+ text was generated by Pythia-70M-deduped.
137
+
138
+ ### Quickstart
139
+
140
+ Pythia models can be loaded and used via the following code, demonstrated here
141
+ for the third `pythia-70m-deduped` checkpoint:
142
+
143
+ ```python
144
+ from transformers import GPTNeoXForCausalLM, AutoTokenizer
145
+
146
+ model = GPTNeoXForCausalLM.from_pretrained(
147
+ "EleutherAI/pythia-70m-deduped",
148
+ revision="step3000",
149
+ cache_dir="./pythia-70m-deduped/step3000",
150
+ )
151
+
152
+ tokenizer = AutoTokenizer.from_pretrained(
153
+ "EleutherAI/pythia-70m-deduped",
154
+ revision="step3000",
155
+ cache_dir="./pythia-70m-deduped/step3000",
156
+ )
157
+
158
+ inputs = tokenizer("Hello, I am", return_tensors="pt")
159
+ tokens = model.generate(**inputs)
160
+ tokenizer.decode(tokens[0])
161
+ ```
162
+
163
+ Revision/branch `step143000` corresponds exactly to the model checkpoint on
164
+ the `main` branch of each model.<br>
165
+ For more information on how to use all Pythia models, see [documentation on
166
+ GitHub](https://github.com/EleutherAI/pythia).
167
+
168
+ ### Training
169
+
170
+ #### Training data
171
+
172
+ Pythia-70M-deduped was trained on the Pile **after the dataset has been
173
+ globally deduplicated**.<br>
174
+ [The Pile](https://pile.eleuther.ai/) is a 825GiB general-purpose dataset in
175
+ English. It was created by EleutherAI specifically for training large language
176
+ models. It contains texts from 22 diverse sources, roughly broken down into
177
+ five categories: academic writing (e.g. arXiv), internet (e.g. CommonCrawl),
178
+ prose (e.g. Project Gutenberg), dialogue (e.g. YouTube subtitles), and
179
+ miscellaneous (e.g. GitHub, Enron Emails). See [the Pile
180
+ paper](https://arxiv.org/abs/2101.00027) for a breakdown of all data sources,
181
+ methodology, and a discussion of ethical implications. Consult [the
182
+ datasheet](https://arxiv.org/abs/2201.07311) for more detailed documentation
183
+ about the Pile and its component datasets. The Pile can be downloaded from
184
+ the [official website](https://pile.eleuther.ai/), or from a [community
185
+ mirror](https://the-eye.eu/public/AI/pile/).
186
+
187
+ #### Training procedure
188
+
189
+ All models were trained on the exact same data, in the exact same order. Each
190
+ model saw 299,892,736,000 tokens during training, and 143 checkpoints for each
191
+ model are saved every 2,097,152,000 tokens, spaced evenly throughout training.
192
+ This corresponds to training for just under 1 epoch on the Pile for
193
+ non-deduplicated models, and about 1.5 epochs on the deduplicated Pile.
194
+
195
+ All *Pythia* models trained for the equivalent of 143000 steps at a batch size
196
+ of 2,097,152 tokens. Two batch sizes were used: 2M and 4M. Models with a batch
197
+ size of 4M tokens listed were originally trained for 71500 steps instead, with
198
+ checkpoints every 500 steps. The checkpoints on Hugging Face are renamed for
199
+ consistency with all 2M batch models, so `step1000` is the first checkpoint
200
+ for `pythia-1.4b` that was saved (corresponding to step 500 in training), and
201
+ `step1000` is likewise the first `pythia-6.9b` checkpoint that was saved
202
+ (corresponding to 1000 “actual” steps).<br>
203
+ See [GitHub](https://github.com/EleutherAI/pythia) for more details on training
204
+ procedure, including [how to reproduce
205
+ it](https://github.com/EleutherAI/pythia/blob/main/README.md#reproducing-training).<br>
206
+ Pythia uses the same tokenizer as [GPT-NeoX-
207
+ 20B](https://huggingface.co/EleutherAI/gpt-neox-20b).
208
+
209
+ ### Evaluations
210
+
211
+ All 16 *Pythia* models were evaluated using the [LM Evaluation
212
+ Harness](https://github.com/EleutherAI/lm-evaluation-harness). You can access
213
+ the results by model and step at `results/json/*` in the [GitHub
214
+ repository](https://github.com/EleutherAI/pythia/tree/main/results/json).<br>
215
+ Expand the sections below to see plots of evaluation results for all
216
+ Pythia and Pythia-deduped models compared with OPT and BLOOM.
217
+
218
+ <details>
219
+ <summary>LAMBADA – OpenAI</summary>
220
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/lambada_openai.png" style="width:auto"/>
221
+ </details>
222
+
223
+ <details>
224
+ <summary>Physical Interaction: Question Answering (PIQA)</summary>
225
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/piqa.png" style="width:auto"/>
226
+ </details>
227
+
228
+ <details>
229
+ <summary>WinoGrande</summary>
230
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/winogrande.png" style="width:auto"/>
231
+ </details>
232
+
233
+ <details>
234
+ <summary>AI2 Reasoning Challenge – Challenge Set</summary>
235
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/arc_challenge.png" style="width:auto"/>
236
+ </details>
237
+
238
+ <details>
239
+ <summary>SciQ</summary>
240
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/sciq.png" style="width:auto"/>
241
+ </details>
242
+
243
+ ### Naming convention and parameter count
244
+
245
+ *Pythia* models were renamed in January 2023. It is possible that the old
246
+ naming convention still persists in some documentation by accident. The
247
+ current naming convention (70M, 160M, etc.) is based on total parameter count.
248
+
249
+ <figure style="width:32em">
250
+
251
+ | current Pythia suffix | old suffix | total params | non-embedding params |
252
+ | --------------------: | ---------: | -------------: | -------------------: |
253
+ | 70M | 19M | 70,426,624 | 18,915,328 |
254
+ | 160M | 125M | 162,322,944 | 85,056,000 |
255
+ | 410M | 350M | 405,334,016 | 302,311,424 |
256
+ | 1B | 800M | 1,011,781,632 | 805,736,448 |
257
+ | 1.4B | 1.3B | 1,414,647,808 | 1,208,602,624 |
258
+ | 2.8B | 2.7B | 2,775,208,960 | 2,517,652,480 |
259
+ | 6.9B | 6.7B | 6,857,302,016 | 6,444,163,072 |
260
+ | 12B | 13B | 11,846,072,320 | 11,327,027,200 |
261
+ </figure>
262
+