Change pipeline_tag to text generation, add placeholders for paper links, incorporate SL change recommendations
Browse files
README.md
CHANGED
@@ -7,10 +7,11 @@ tags:
|
|
7 |
license: apache-2.0
|
8 |
datasets:
|
9 |
- the_pile
|
10 |
-
|
11 |
---
|
12 |
|
13 |
# Cerebras-GPT 111M
|
|
|
14 |
|
15 |
## Model Description
|
16 |
|
@@ -18,7 +19,7 @@ The Cerebras-GPT family is released to facilitate research into LLM scaling laws
|
|
18 |
|
19 |
The family includes 111M, 256M, 590M, 1.3B, 2.7B, 6.7B, and 13B models.
|
20 |
|
21 |
-
All models in the Cerebras-GPT family have been trained in accordance with [Chinchilla scaling laws](https://arxiv.org/abs/2203.15556) (20 tokens per model parameter) which
|
22 |
|
23 |
These models were trained on the [Andromeda](https://www.cerebras.net/andromeda/) AI supercomputer comprised of 16 CS-2 wafer scale systems. Cerebras' [weight streaming technology](https://www.cerebras.net/blog/linear-scaling-made-possible-with-weight-streaming) simplifies the training of LLMs by disaggregating compute from model storage. This allowed for efficient scaling of training across nodes using simple data parallelism.
|
24 |
|
@@ -28,7 +29,7 @@ Cerebras systems for pre-training and fine tuning are available in the cloud via
|
|
28 |
* Developed by: [Cerebras Systems](https://www.cerebras.net/)
|
29 |
* License: Apache 2.0
|
30 |
* Model type: Transformer-based Language Model
|
31 |
-
* Architecture: GPT-
|
32 |
* Data set: The Pile
|
33 |
* Tokenizer: Byte Pair Encoding
|
34 |
* Vocabulary Size: 50257
|
@@ -109,7 +110,7 @@ Our tokenized version of the Pile has 371B tokens. We used byte-pair encoding, a
|
|
109 |
|
110 |
## Training procedure
|
111 |
|
112 |
-
We use the GPT-
|
113 |
|
114 |
All models were trained to Chinchilla point: 20x more tokens than model parameters. Number of steps changed based on fixed batch size (2048) and sequence length (varied by model). See Training Table, below, for detail.
|
115 |
|
@@ -192,14 +193,15 @@ We evaluate our models on the PILE validation set comprising 380M tokens. We als
|
|
192 |
## Uses and Limitations
|
193 |
|
194 |
### Intended Use
|
195 |
-
The models we train are being open-sourced to further research into LLM scaling laws but
|
|
|
|
|
196 |
|
197 |
-
The primary intended users of these models are AI researchers and practitioners interested in testing the behaviors, capabilities, and limitations of large-scale generative language models.
|
198 |
|
199 |
### Out of Scope Use
|
200 |
Cerebras-GPT models are trained on the Pile, with English language only, and are not suitable for machine translation tasks.
|
201 |
|
202 |
-
Cerebras-GPT models have not been tuned for human-facing dialog applications like chatbots and will not respond to prompts in a similar way to models that have received instruction tuning or
|
203 |
|
204 |
### Risk and Bias
|
205 |
Like many large text corpora, the Pile contains offensive text. Cerebras-GPT models trained on this text may create offensive or undesirable text outputs regardless of whether the initial prompt is offensive. Human filtering of responses is recommended.
|
|
|
7 |
license: apache-2.0
|
8 |
datasets:
|
9 |
- the_pile
|
10 |
+
pipeline_tag: text-generation
|
11 |
---
|
12 |
|
13 |
# Cerebras-GPT 111M
|
14 |
+
[TODO: arXiv paper](https://www.cerebras.net), [TODO: Blog Post](https://www.cerebras.net)
|
15 |
|
16 |
## Model Description
|
17 |
|
|
|
19 |
|
20 |
The family includes 111M, 256M, 590M, 1.3B, 2.7B, 6.7B, and 13B models.
|
21 |
|
22 |
+
All models in the Cerebras-GPT family have been trained in accordance with [Chinchilla scaling laws](https://arxiv.org/abs/2203.15556) (20 tokens per model parameter) which is compute-optimal.
|
23 |
|
24 |
These models were trained on the [Andromeda](https://www.cerebras.net/andromeda/) AI supercomputer comprised of 16 CS-2 wafer scale systems. Cerebras' [weight streaming technology](https://www.cerebras.net/blog/linear-scaling-made-possible-with-weight-streaming) simplifies the training of LLMs by disaggregating compute from model storage. This allowed for efficient scaling of training across nodes using simple data parallelism.
|
25 |
|
|
|
29 |
* Developed by: [Cerebras Systems](https://www.cerebras.net/)
|
30 |
* License: Apache 2.0
|
31 |
* Model type: Transformer-based Language Model
|
32 |
+
* Architecture: GPT-3 style architecture
|
33 |
* Data set: The Pile
|
34 |
* Tokenizer: Byte Pair Encoding
|
35 |
* Vocabulary Size: 50257
|
|
|
110 |
|
111 |
## Training procedure
|
112 |
|
113 |
+
We use the GPT-3 style model architecture. All of our layers use full attention as opposed to the GPT-3 style sparse banded attention. The model shapes were selected to either follow aspect ratio 80 or are the same shape as GPT-3 models. Learning rate warmed up for 375M tokens (1500 steps for 111M and 256M models) and 10x cosine decayed. No dropout was used and weight decay was set to 0.1.
|
114 |
|
115 |
All models were trained to Chinchilla point: 20x more tokens than model parameters. Number of steps changed based on fixed batch size (2048) and sequence length (varied by model). See Training Table, below, for detail.
|
116 |
|
|
|
193 |
## Uses and Limitations
|
194 |
|
195 |
### Intended Use
|
196 |
+
The models we train are being open-sourced to further research into LLM scaling laws, but release these models with a fully permissive Apache license for the community to use freely.
|
197 |
+
|
198 |
+
You may fine-tune and adapt Cerebras-GPT models for deployment via either Cerebras [Model Studio](https://www.cerebras.net/product-cloud/) or the Hugging Face Transformers Library. We recommend assessing potential bias and harms prior to deployment of any LLM.
|
199 |
|
|
|
200 |
|
201 |
### Out of Scope Use
|
202 |
Cerebras-GPT models are trained on the Pile, with English language only, and are not suitable for machine translation tasks.
|
203 |
|
204 |
+
Cerebras-GPT models have not been tuned for human-facing dialog applications like chatbots and will not respond to prompts in a similar way to models that have received instruction tuning or reinforcement learning from human feedback (RLHF) like Flan-T5 or ChatGPT. Cerebras-GPT models can be tuned using those methods.
|
205 |
|
206 |
### Risk and Bias
|
207 |
Like many large text corpora, the Pile contains offensive text. Cerebras-GPT models trained on this text may create offensive or undesirable text outputs regardless of whether the initial prompt is offensive. Human filtering of responses is recommended.
|