metadata

language:
  - en
tags:
  - pytorch
  - causal-lm
license: apache-2.0
datasets:
  - the_pile
pipeline_tag: text-generation

Cerebras-GPT 13B

TODO: arXiv paper, TODO: Blog Post

Model Description

The Cerebras-GPT family is released to facilitate research into LLM scaling laws using open architectures and data sets and demonstrate the simplicity of and scalability of training LLMs on the Cerebras software and hardware stack. All Cerebras-GPT models are available on Hugging Face.

The family includes 111M, 256M, 590M, 1.3B, 2.7B, 6.7B, and 13B models.

All models in the Cerebras-GPT family have been trained in accordance with Chinchilla scaling laws (20 tokens per model parameter) which is compute-optimal.

These models were trained on the Andromeda AI supercomputer comprised of 16 CS-2 wafer scale systems. Cerebras' weight streaming technology simplifies the training of LLMs by disaggregating compute from model storage. This allowed for efficient scaling of training across nodes using simple data parallelism.

Cerebras systems for pre-training and fine tuning are available in the cloud via the Cerebras Model Studio. Cerebras CS-2 compatible checkpoints are available in Cerebras Model Zoo.

Model Details

Developed by: Cerebras Systems
License: Apache 2.0
Model type: Transformer-based Language Model
Architecture: GPT-3 style architecture
Data set: The Pile
Tokenizer: Byte Pair Encoding
Vocabulary Size: 50257
Sequence Length: 2048
Optimizer: AdamW, (β1, β2) = (0.9, 0.95), adam_eps = 1e−8 (1e−9 for larger models)
Positional Encoding: Learned
Language: English
Learn more: Dense Scaling Laws Paper for training procedure, config files, and details on how to use.

Contact: To ask questions about Cerebras-GPT models, join the Cerebras Discord, and post them in #scaling-laws-release.

This is the standard parameterization version of Cerebras-GPT with 13B parameters

Related models: Cerebras-GPT Models

Model	Parameters	Layers	d_model	Heads	d_head	d_ffn	LR	BS (seq)	BS (tokens)
Cerebras-GPT	111M	10	768	12	64	3072	6.00E-04	120	246K
Cerebras-GPT	256M	14	1088	17	64	4352	6.00E-04	264	541K
Cerebras-GPT	590M	18	1536	12	128	6144	2.00E-04	264	541K
Cerebras-GPT	1.3B	24	2048	16	128	8192	2.00E-04	528	1.08M
Cerebras-GPT	2.7B	32	2560	20	128	10240	2.00E-04	528	1.08M
Cerebras-GPT	6.7B	32	4096	32	128	16384	1.20E-04	1040	2.13M
Cerebras-GPT	13B	40	5120	40	128	20480	1.20E-04	720/1080	1.47M/2.21M

Quickstart

This model can be easily loaded using the AutoModelForCausalLM functionality:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Cerebras/Cerebras-GPT-13B")
model = AutoModelForCausalLM.from_pretrained("Cerebras/Cerebras-GPT-13B")

text = "Generative AI is "

And can be used with Hugging Face Pipelines

from transformers import pipeline

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
generated_text = pipe(text, max_length=50, do_sample=False, no_repeat_ngram_size=2)[0]
print(generated_text['generated_text'])

or with model.generate()

inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, num_beams=5, 
                        max_new_tokens=50, early_stopping=True,
                        no_repeat_ngram_size=2)
text_output = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(text_output[0])

Training data

Cerebras-GPT is trained using the Pile dataset from EleutherAI. See the Pile paper for a more detailed breakdown of data sources and methodology.

Recent works find significant duplicate data present in the Pile. Eleuther’s Pythia applies a deduplication process to reduce replicated data, decreasing the total token count by 33%. Our models are trained on the Pile without deduplication, which presents an opportunity for further improvement with the deduplicated data set.

Our tokenized version of the Pile has 371B tokens. We used byte-pair encoding, a vocabulary size of 50257, and a maximum sequence length of 2048. We include more details about the training dataset preprocessing in Appendix B.1 of TODO: our paper.

Training procedure

We use the GPT-3 style model architecture. All of our layers use full attention as opposed to the GPT-3 style sparse banded attention. The model shapes were selected to either follow aspect ratio 80 or are the same shape as GPT-3 models. Learning rate warmed up for 375M tokens (1500 steps for 111M and 256M models) and 10x cosine decayed. No dropout was used and weight decay was set to 0.1.

All models were trained to Chinchilla point: 20x more tokens than model parameters. Number of steps changed based on fixed batch size (2048) and sequence length (varied by model). See Training Table, below, for detail.

Model Params	Sequence Length	Batch Size	Number of Steps	Tokens	Tokens per Parameter	Flops
111M	2048	120	9037	2.22E+09	20	2.5E+18
256M	2048	264	9468	5.12E+09	20	1.1E+19
590M	2048	264	21836	1.18E+10	20	5.3E+19
1.3B	2048	528	24334	2.63E+10	20	2.5E+20
2.7B	2048	528	49041	5.30E+10	20	9.8E+20
6.7B	2048	1040	62522	1.33E+11	20	5.9E+21
13B	2048	720	174335	2.57E+11	20	2.1E+22

Evaluations

We evaluate our models on the PILE validation set comprising 380M tokens. In our paper we also evaluate the public checkpoints of Pythia, Eleuther (2022); OPT, Zhang et al. (2022); GPT-NeoX 20B, Black et al. (2022); and GPT-J 6B, Wang & Komatsuzaki (2021). We trained models from smallest to largest and fit a power law as we went along. The power law was helpful for extrapolating the validation loss of the next largest model we trained and provided confidence about whether the training run was going well.

0-shot Evaluation

Model	Params	Training FLOPs	PILE test xent	Hella-Swag	PIQA	Wino-Grande	Lambada	ARC-e	ARC-c	OpenBookQA	Downstream Average
Cerebras-GPT	111M	2.5E+18	2.566	0.268	0.594	0.488	0.194	0.380	0.166	0.118	0.315
Cerebras-GPT	256M	1.1E+19	2.299	0.274	0.613	0.511	0.293	0.410	0.170	0.158	0.347
Cerebras-GPT	590M	5.3E+19	2.184	0.291	0.627	0.498	0.366	0.464	0.190	0.158	0.370
Cerebras-GPT	1.3B	2.5E+20	1.996	0.325	0.664	0.521	0.462	0.508	0.224	0.166	0.410
Cerebras-GPT	2.7B	9.8E+20	1.834	0.386	0.701	0.559	0.567	0.571	0.246	0.206	0.462
Cerebras-GPT	6.7B	5.9E+21	TODO	TODO	TODO	TODO	TODO	TODO	TODO	TODO	TODO
Cerebras-GPT	13B	2.1E+22	1.575	0.513	0.766	0.646	0.696	0.714	0.367	0.286	0.570

5-shot Evaluation

Model	Params	Hella-Swag	PIQA	Wino-Grande	Lambada	ARC-e	ARC-c	OpenBookQA
Cerebras-GPT	111M	0.267	0.588	0.475	0.158	0.356	0.166	0.136
Cerebras-GPT	256M	0.278	0.606	0.522	0.225	0.422	0.183	0.164
Cerebras-GPT	590M	0.291	0.634	0.479	0.281	0.475	0.206	0.152
Cerebras-GPT	1.3B	0.326	0.668	0.536	0.395	0.529	0.241	0.174
Cerebras-GPT	2.7B	0.382	0.697	0.543	0.487	0.590	0.267	0.224
Cerebras-GPT	6.7B	TODO	TODO	TODO	TODO	TODO	TODO	TODO
Cerebras-GPT	13B	0.514	0.768	0.674	0.655	0.743	0.398	0.318

Uses and Limitations

Intended Use

The models we train are being open-sourced to further research into LLM scaling laws, but we release these models with a fully permissive Apache license for the community to use freely.

You may fine-tune and adapt Cerebras-GPT models for deployment via either Cerebras Model Studio or the Hugging Face Transformers Library. We recommend assessing potential bias and harms prior to deployment of any LLM.

Out of Scope Use

Cerebras-GPT models are trained on the Pile, with English language only, and are not suitable for machine translation tasks.

Cerebras-GPT models have not been tuned for human-facing dialog applications like chatbots and will not respond to prompts in a similar way to models that have received instruction tuning or reinforcement learning from human feedback (RLHF) like Flan-T5 or ChatGPT. Cerebras-GPT models can be tuned using those methods.

Risk and Bias

Like many large text corpora, the Pile contains offensive text. Cerebras-GPT models trained on this text may create offensive or undesirable text outputs regardless of whether the initial prompt is offensive. Human filtering of responses is recommended.

Citation and Related Information

BibTeX entry

To cite this model:

@misc{Cerebras-GPT,
  author = {Nolan Dey and Gurpreet Gosal and Charles Chen and Hemant Khachane and Ribhu Pathria and William Marshall and Marvin Tom and Joel Hestness},
  title = {GPT-3 Scaling Laws for the PILE Dataset, Trained on the Cerebras Wafer-Scale Engine},
  year = {2023},
  month = {March},
  howpublished = {\url{https://www.cerebras.net/TODO/dense-scaling-laws/TODO}}
}

Acknowledgements

We are thankful to all Cerebras engineers, past and present, that made this work possible.