mosaicml
/

mpt-1b-redpajama-200b

Text Generation

Model card Files Files and versions Community

mpt-1b-redpajama-200b / README.md

jfrankle's picture

Create README.md

8065593 over 1 year ago

|

3.03 kB

	---
	license: apache-2.0
	datasets:
	- togethercomputer/RedPajama-Data-1T
	---

	# MPT-1B-RedPajama

	MPT-1B-RedPajama is a 1B parameter decoder-only transformer trained on the [RedPajama dataset](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T).
	The model was trained for 200B tokens by sampling from the subsets of the RedPajama dataset in the same proportions as were used by the [Llama series of models](https://arxiv.org/abs/2302.13971).
	This model was trained by [MosaicML](https://www.mosaicml.com) and follows the MPT architecture.

	## Model Date

	April 19, 2023

	## How to Use

	Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method.
	This is because we train using [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), which is not part of the `transformers` library and depends on [Triton](https://github.com/openai/triton) and some custom PyTorch code.

	```python
	import transformers
	model = transformers.AutoModelForCausalLM.from_pretrained('mosaicml/mosaic-llama-redpajama-final-candidate', trust_remote_code=True)```
	```

	## Model Description

	This model uses the MPT architecture, which can be found in the [MosaicML Examples Repository](https://github.com/mosaicml/examples/tree/v0.0.4/examples/llm).
	The MPT architecture is a modification of a standard decoder-only transformer.
	The transformer has 24 layers, 16 attention heads, and width 2048.
	The model has been modified from a standard transformer in the following ways:
	* It uses FlashAttention.
	* It uses ALiBi position encodings.
	* It does not use biases.
	* It includes layernorm after the keys and queries of the attention operation.

	## Training Data

	The model was trained for 200B tokens (batch size 2200, sequence length 2048). It was trained on the following data mix:
	* 67% RedPajama Common Crawl
	* 15% [C4](https://huggingface.co/datasets/c4)
	* 4.5% RedPajama GitHub
	* 4.5% RedPajama Wikipedia
	* 4.5% RedPajama Books
	* 2.5% RedPajama Arxiv
	* 2% RedPajama StackExchange

	This is the same mix of data as was used in the Llama series of models](https://arxiv.org/abs/2302.13971).

	Each sample was chosen from one of the datasets, with the dataset selected with the probability specified above.
	The examples were shuffled within each dataset.
	Each example was constructed from as many sequences from that dataset as were necessary to fill the 2048 sequence length.

	The data was tokenized using the GPT-NeoX tokenizer.

	## Acknowledgements

	This model builds on the work of [Together](https://www.together.xyz), which created the RedPajama dataset with the goal of mimicking the training data used to create the Llama series of models.
	We gratefully acknowledge the hard work of the team that put together this dataset, and we hope this model serves as a useful companion to that work.

	We also gratefully acknowledge the work of the researchers who created the Llama series of models, which was the impetus for our efforts and those who worked on the RedPajama project.