rskuzma commited on
Commit
ec12d3b
1 Parent(s): 85ee167

First README version for 111M Requires review; links to paper; additions to evaluation table; and BibTeX

Browse files
Files changed (1) hide show
  1. README.md +224 -0
README.md CHANGED
@@ -1,3 +1,227 @@
1
  ---
 
 
 
 
 
2
  license: apache-2.0
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - pytorch
6
+ - causal-lm
7
  license: apache-2.0
8
+ datasets:
9
+ - the_pile
10
+
11
  ---
12
+
13
+ # Cerebras-GPT 111M
14
+
15
+ ## Model Description
16
+
17
+ The Cerebras-GPT family is released to facilitate research into LLM scaling laws using open architectures and data sets and demonstrate the simplicity of and scalability of training LLMs on the Cerebras software and hardware stack. All Cerebras-GPT models are available on Hugging Face.
18
+
19
+ The family includes 111M, 256M, 590M, 1.3B, 2.7B, 6.7B, and 13B models.
20
+
21
+ All models in the Cerebras-GPT family have been trained in accordance with [Chinchilla scaling laws](https://arxiv.org/abs/2203.15556) (20 tokens per model parameter) which yields improved performance at smaller model size.
22
+
23
+ These models were trained on the [Andromeda](https://www.cerebras.net/andromeda/) AI supercomputer comprised of 16 CS-2 wafer scale systems. Cerebras' [weight streaming technology](https://www.cerebras.net/blog/linear-scaling-made-possible-with-weight-streaming) simplifies the training of LLMs by disaggregating compute from model storage. This allowed for efficient scaling of training across nodes using simple data parallelism.
24
+
25
+ Cerebras systems for pre-training and fine tuning are available in the cloud via the [Cerebras Model Studio](https://www.cerebras.net/product-cloud/). Cerebras CS-2 compatible checkpoints are available in [Cerebras Model Zoo](https://github.com/Cerebras/modelzoo).
26
+
27
+ ## Model Details
28
+ * Developed by: [Cerebras Systems](https://www.cerebras.net/)
29
+ * License: Apache 2.0
30
+ * Model type: Transformer-based Language Model
31
+ * Architecture: GPT-2 model architecture with hyperparameters more similar to GPT-3.
32
+ * Data set: The Pile
33
+ * Tokenizer: Byte Pair Encoding
34
+ * Vocabulary Size: 50257
35
+ * Sequence Length: 2048
36
+ * Optimizer: AdamW, (β1, β2) = (0.9, 0.95), adam_eps = 1e−8 (1e−9 for larger models)
37
+ * Positional Encoding: Learned
38
+ * Language: English
39
+ * Learn more: Dense Scaling Laws Paper for training procedure, config files, and details on how to use.
40
+
41
+ **Contact**: To ask questions about Cerebras-GPT models, join the Cerebras Discord, and post them in **#scaling-laws-release.**
42
+
43
+ This is the standard parameterization version of Cerebras-GPT with **111M** parameters
44
+
45
+ Related models: [Cerebras-GPT Models](https://huggingface.co/models?sort=downloads&search=cerebras-gpt)
46
+
47
+ <br><br>
48
+
49
+ | Model | Parameters | Layers | d_model | Heads | d_head | d_ffn | LR | BS (seq) | BS (tokens) |
50
+ |---------------|------------|--------|---------|-------|--------|--------|----------|----------|----------------|
51
+ | Cerebras-GPT | 111M | 10 | 768 | 12 | 64 | 3072 | 6.00E-04 | 120 | 246K |
52
+ | Cerebras-GPT | 256M | 14 | 1088 | 17 | 64 | 4352 | 6.00E-04 | 264 | 541K |
53
+ | Cerebras-GPT | 590M | 18 | 1536 | 12 | 128 | 6144 | 2.00E-04 | 264 | 541K |
54
+ | Cerebras-GPT | 1.3B | 24 | 2048 | 16 | 128 | 8192 | 2.00E-04 | 528 | 1.08M |
55
+ | Cerebras-GPT | 2.7B | 32 | 2560 | 20 | 128 | 10240 | 2.00E-04 | 528 | 1.08M |
56
+ | Cerebras-GPT | 6.7B | 32 | 4096 | 32 | 128 | 16384 | 1.20E-04 | 1040 | 2.13M |
57
+ | Cerebras-GPT | 13B | 40 | 5120 | 40 | 128 | 20480 | 1.20E-04 | 720/1080 | 1.47M/2.21M |
58
+ | Cerebras-GPT-muP | 111M | 10 | 768 | 12 | 64 | 3072 | 6.00E-03 | 120 | 246K |
59
+ | Cerebras-GPT-muP | 256M | 14 | 1088 | 17 | 64 | 4352 | 6.00E-03 | 264 | 541K |
60
+ | Cerebras-GPT-muP | 590M | 18 | 1536 | 12 | 128 | 6144 | 6.00E-03 | 264 | 541K |
61
+ | Cerebras-GPT-muP | 1.3B | 24 | 2048 | 16 | 128 | 8192 | 6.00E-03 | 528 | 1.08M |
62
+ | Cerebras-GPT-muP | 2.7B | 32 | 2560 | 20 | 128 | 10240 | 6.00E-03 | 528 | 1.08M |
63
+
64
+ <br><br>
65
+
66
+ ## Quickstart
67
+
68
+ This model can be easily loaded using the AutoModelForCausalLM functionality:
69
+ ```python
70
+ from transformers import AutoTokenizer, AutoModelForCausalLM
71
+
72
+ tokenizer = AutoTokenizer.from_pretrained("Cerebras/Cerebras-GPT-111M")
73
+ model = AutoModelForCausalLM.from_pretrained("Cerebras/Cerebras-GPT-111M")
74
+
75
+ text = "Generative AI is "
76
+ ```
77
+
78
+ And can be used with Hugging Face Pipelines
79
+
80
+ ```python
81
+ from transformers import pipeline
82
+
83
+ pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
84
+ generated_text = pipe(text, max_length=50, do_sample=False, no_repeat_ngram_size=2)[0]
85
+ print(generated_text['generated_text'])
86
+ ```
87
+
88
+ or with `model.generate()`
89
+
90
+ ```python
91
+ inputs = tokenizer(text, return_tensors="pt")
92
+ outputs = model.generate(**inputs, num_beams=5,
93
+ max_new_tokens=50, early_stopping=True,
94
+ no_repeat_ngram_size=2)
95
+ text_output = tokenizer.batch_decode(outputs, skip_special_tokens=True)
96
+ print(text_output[0])
97
+ ```
98
+ <br><br>
99
+
100
+ ## Training data
101
+
102
+ Cerebras-GPT is trained using [the Pile](https://pile.eleuther.ai) dataset from [EleutherAI](https://www.eleuther.ai) which consists of data from 22 data sources. See the [Pile paper](https://arxiv.org/abs/2101.00027) for a more detailed breakdown of data sources and methodology.
103
+
104
+ Recent works find significant duplicate data present in the Pile. Eleuther’s Pythia applies a deduplication process to reduce replicated data, decreasing the total token count by 33%. Our models are trained on the Pile **without deduplication**, which presents an opportunity for further improvement with the deduplicated data set.
105
+
106
+ Our tokenized version of the Pile has 371B tokens. We used byte-pair encoding, a vocabulary size of 50257, and a maximum sequence length of 2048. We include more details about the training dataset preprocessing in Appendix B.1 of our paper.
107
+
108
+ <br><br>
109
+
110
+ ## Training procedure
111
+
112
+ We use the GPT-2 model architecture with hyperparameters more similar to GPT-3. All of our layers use full attention as opposed to the GPT-3 style sparse banded attention. The model shapes were selected to either follow aspect ratio 80 or are the same shape as GPT3 models. Learning rate warmed up for 375M tokens (1500 steps for 111M and 256M models) and 10x cosine decayed. No dropout was used and weight decay was set to 0.1.
113
+
114
+ All models were trained to Chinchilla point: 20x more tokens than model parameters. Number of steps changed based on fixed batch size (2048) and sequence length (varied by model). See Training Table, below, for detail.
115
+
116
+ <br>
117
+
118
+ Model Params | Sequence Length | Batch Size | Number of Steps | Tokens | Tokens per Parameter | Flops
119
+ ------------ | -------------- | ---------- | --------------- | ------ | -------------------- | -----
120
+ 111M | 2048 | 120 | 9037 | 2.22E+09 | 20 | 2.5E+18
121
+ 256M | 2048 | 264 | 9468 | 5.12E+09 | 20 | 1.1E+19
122
+ 590M | 2048 | 264 | 21836 | 1.18E+10 | 20 | 5.3E+19
123
+ 1.3B | 2048 | 528 | 24334 | 2.63E+10 | 20 | 2.5E+20
124
+ 2.7B | 2048 | 528 | 49041 | 5.30E+10 | 20 | 9.8E+20
125
+ 6.7B | 2048 | 1040 | 62522 | 1.33E+11 | 20 | 5.9E+21
126
+ 13B | 2048 | 720 | 174335 | 2.57E+11 | 20 | 2.1E+22
127
+
128
+ <br><br>
129
+
130
+ ## Evaluations
131
+
132
+ We evaluate our models on the PILE validation set comprising 380M tokens. We also evaluate the public checkpoints of Pythia Eleuther (2022), OPT Zhang et al. (2022), GPT-NeoX 20B Black et al. (2022), and GPT-J 6B Wang & Komatsuzaki (2021). We trained models from smallest to largest and fit a power law as we went along. The power law was helpful for extrapolating the validation loss of the next largest model we trained and provided confidence about whether the training run was going well.
133
+
134
+ #### 0-shot Evaluation
135
+ | Model | Count | Training FLOPs | PILE test xent | Hella-Swag | PIQA | Wino-Grande | Lambada | ARC-e | ARC-c | OpenBookQA | Downstream Average |
136
+ | ------- | ----- | -------------- | -------------- | ---------- | ----- | ----------- | ------- | ----- | ----- | ---------- | ------------------ |
137
+ | Cerebras| 111M | 2.5E+18 | 2.566 | 0.268 | 0.594 | 0.488 | 0.194 | 0.380 | 0.166 | 0.118 | 0.315 |
138
+ | | 256M | 1.1E+19 | 2.299 | 0.274 | 0.613 | 0.511 | 0.293 | 0.410 | 0.170 | 0.158 | 0.347 |
139
+ | | 590M | 5.3E+19 | 2.184 | 0.291 | 0.627 | 0.498 | 0.366 | 0.464 | 0.190 | 0.158 | 0.370 |
140
+ | | 1.3B | 2.5E+20 | 1.996 | 0.325 | 0.664 | 0.521 | 0.462 | 0.508 | 0.224 | 0.166 | 0.410 |
141
+ | | 2.7B | 9.8E+20 | 1.834 | 0.386 | 0.701 | 0.559 | 0.567 | 0.571 | 0.246 | 0.206 | 0.462 |
142
+ | | 6.7B | 5.9E+21 | TODO | TODO | TODO | TODO | TODO | TODO | TODO | TODO | TODO |
143
+ | | 13B | 2.1E+22 | 1.575 | 0.513 | 0.766 | 0.646 | 0.696 | 0.714 | 0.367 | 0.286 | 0.570 |
144
+ | Pythia | 70M | 1.3E+20 | TODO | 0.270 | 0.590 | 0.491 | 0.259 | 0.413 | 0.185 | 0.132 | 0.334 |
145
+ | | 160M | 3.4E+20 | TODO | 0.293 | 0.627 | 0.519 | 0.389 | 0.452 | 0.181 | 0.160 | 0.375 |
146
+ | | 410M | 8.8E+20 | TODO | 0.333 | 0.668 | 0.530 | 0.505 | 0.504 | 0.213 | 0.178 | 0.419 |
147
+ | | 1B | 2.0E+21 | TODO | 0.376 | 0.705 | 0.545 | 0.566 | 0.559 | 0.243 | 0.196 | 0.456 |
148
+ | | 2.8B | 5.5E+21 | TODO | 0.451 | 0.737 | 0.612 | 0.654 | 0.629 | 0.288 | 0.220 | 0.513 |
149
+ | | 6.9B | 2.6E+22 | TODO | 0.482 | 0.746 | 0.611 | 0.679 | 0.669 | 0.323 | 0.270 | 0.540 |
150
+ | | 12B | 9.0E+22 | TODO | 0.505 | 0.761 | 0.645 | 0.705 | 0.700 | 0.336 | 0.284 | 0.562 |
151
+ | NeoX | 20B | 6.1E+22 | TODO | 0.535 | 0.779 | 0.661 | 0.720 | 0.723 | 0.380 | 0.290 | 0.584 |
152
+ | GPTJ | 6B | 1.5E+22 | TODO | 0.518 | 0.752 | 0.640 | 0.683 | 0.670 | 0.340 | 0.288 | 0.556 |
153
+ | OPT | 125M | 3.4E+20 | - | 0.292 | 0.630 | 0.503 | 0.379 | 0.435 | 0.189 | 0.166 | 0.371 |
154
+ | | 350M | 8.8E+20 | - | 0.320 | 0.644 | 0.523 | 0.452 | 0.440 | 0.207 | 0.176 | 0.395 |
155
+ | | 1.3B | 2.8E+21 | - | 0.415 | 0.717 | 0.595 | 0.579 | 0.570 | 0.234 | 0.234 | 0.478 |
156
+ | | 2.7B | 5.5E+21 | - | 0.458 | 0.738 | 0.610 | 0.637 | 0.609 | 0.268 | 0.250 | 0.510 |
157
+ | | 6.7B | 1.3E+22 | - | 0.505 | 0.763 | 0.654 | 0.677 | 0.656 | 0.307 | 0.276 | 0.548 |
158
+ | | 13B | 2.5E+22 | - | 0.524 | 0.759 | 0.651 | 0.687 | 0.671 | 0.329 | 0.270 | 0.556 |
159
+ | LLaMA | 6.7B | | 0.761 | 0.798 | 0.701 | | 0.728 | 0.476 | 0.572 | | |
160
+ | | 13B | | 0.792 | 0.801 | 0.730 | | 0.748 | 0.527 | 0.564 | | |
161
+
162
+
163
+ #### 5-shot Evaluation
164
+ | Model | Count | Hella-Swag | PIQA | Wino-Grande | Lambada | ARC-e | ARC-c | OpenBookQA |
165
+ | -------- | ----- | ----------| ----- | ----------- | -------| ----- | ----- | ---------- |
166
+ | Cerebras | 111M | 0.267 | 0.588 | 0.475 | 0.158 | 0.356 | 0.166 | 0.136 |
167
+ | | 256M | 0.278 | 0.606 | 0.522 | 0.225 | 0.422 | 0.183 | 0.164 |
168
+ | | 590M | 0.291 | 0.634 | 0.479 | 0.281 | 0.475 | 0.206 | 0.152 |
169
+ | | 1.3B | 0.326 | 0.668 | 0.536 | 0.395 | 0.529 | 0.241 | 0.174 |
170
+ | | 2.7B | 0.382 | 0.697 | 0.543 | 0.487 | 0.590 | 0.267 | 0.224 |
171
+ | | 6.7B | TODO | TODO | TODO | TODO | TODO | TODO | TODO |
172
+ | | 13B | 0.514 | 0.768 | 0.674 | 0.655 | 0.743 | 0.398 | 0.318 |
173
+ | Pythia | 70M | 0.269 | 0.589 | 0.491 | 0.192 | 0.399 | 0.184 | 0.148 |
174
+ | | 160M | 0.292 | 0.631 | 0.515 | 0.329 | 0.469 | 0.205 | 0.164 |
175
+ | | 410M | 0.333 | 0.669 | 0.522 | 0.448 | 0.526 | 0.229 | 0.188 |
176
+ | | 1B | 0.374 | 0.709 | 0.562 | 0.514 | 0.596 | 0.265 | 0.206 |
177
+ | | 1.4B | 0.398 | 0.712 | 0.573 | 0.553 | 0.622 | 0.274 | 0.214 |
178
+ | | 2.8B | 0.448 | 0.738 | 0.621 | 0.629 | 0.673 | 0.328 | 0.254 |
179
+ | | 6.9B | 0.478 | 0.750 | 0.646 | 0.641 | 0.699 | 0.355 | 0.296 |
180
+ | | 12B | 0.506 | 0.759 | 0.662 | 0.673 | 0.731 | 0.383 | 0.322 |
181
+ | NeoX | 20B | 0.538 | 0.774 | 0.683 | 0.698 | 0.746 | 0.410 | 0.326 |
182
+ | GPTJ | 6B | 0.494 | 0.756 | 0.660 | 0.662 | 0.705 | 0.360 | 0.310 |
183
+ | OPT | 125M | 0.289 | 0.628 | 0.520 | 0.303 | 0.426 | 0.197 | 0.166 |
184
+ | | 350M | 0.321 | 0.647 | 0.521 | 0.384 | 0.464 | 0.208 | 0.184 |
185
+ | | 1.3B | 0.413 | 0.726 | 0.597 | 0.553 | 0.604 | 0.273 | 0.230 |
186
+ | | 2.7B | 0.458 | 0.749 | 0.616 | 0.603 | 0.651 | 0.305 | 0.276 |
187
+ | | 6.7B | 0.505 | 0.773 | 0.663 | 0.660 | 0.692 | 0.340 | 0.318 |
188
+ | | 13B | 0.524 | 0.763 | 0.684 | 0.678 | 0.714 | 0.358 | 0.306 |
189
+
190
+ <br><br>
191
+
192
+ ## Uses and Limitations
193
+
194
+ ### Intended Use
195
+ The models we train are being open-sourced to further research into LLM scaling laws but are not intended for use as production models. You may fine-tune and adapt Cerebras-GPT models for deployment via either Cerebras [Model Studio](https://www.cerebras.net/product-cloud/) or the Hugging Face Transformers Library. We recommend assessing potential bias and harms prior to deployment of any LLM.
196
+
197
+ The primary intended users of these models are AI researchers and practitioners interested in testing the behaviors, capabilities, and limitations of large-scale generative language models.
198
+
199
+ ### Out of Scope Use
200
+ Cerebras-GPT models are trained on the Pile, with English language only, and are not suitable for machine translation tasks.
201
+
202
+ Cerebras-GPT models have not been tuned for human-facing dialog applications like chatbots and will not respond to prompts in a similar way to models that have received instruction tuning or Reinforcement Learning from Human Feedback (RLHF) like Flan-T5 or ChatGPT. Cerebras-GPT models can be tuned using those methods.
203
+
204
+ ### Risk and Bias
205
+ Like many large text corpora, the Pile contains offensive text. Cerebras-GPT models trained on this text may create offensive or undesirable text outputs regardless of whether the initial prompt is offensive. Human filtering of responses is recommended.
206
+
207
+ <br><br>
208
+
209
+ # TODO
210
+ ## Citation and Related Information
211
+
212
+ ### BibTeX entry
213
+
214
+ To cite this model:
215
+ ```bibtex
216
+ @misc{Cerebras-GPT,
217
+ author = {Nolan Dey and Gurpreet Gosal and Charles Chen and Hemant Khachane and Ribhu Pathria and William Marshall and Marvin Tom and Joel Hestness},
218
+ title = {GPT-3 Scaling Laws for the PILE Dataset, Trained on the Cerebras Wafer-Scale Engine},
219
+ year = {2023},
220
+ month = {March},
221
+ howpublished = {\url{https://www.cerebras.net/TODO/dense-scaling-laws/TODO}}
222
+ }
223
+ ```
224
+
225
+ ## Acknowledgements
226
+
227
+ We are thankful to all Cerebras engineers, past and present, that made this work possible.