mosaicml
/

mpt-7b-storywriter

+---
+license: cc-by-sa-3.0
+tags:
+- Composer
+- MosaicML
+- llm-foundry
+---
+# MPT-7B-StoryWriter-65k+
+MPT-7B-StoryWriter-65k+ is a model designed to read and write fictional stories with super long context lengths.
+It was built by finetuning MPT-7B with a context length of 65k tokens on a filtered fiction subset of the [books3 dataset](https://huggingface.co/datasets/the_pile_books3).
+At inference time, thanks to [ALiBi](https://arxiv.org/abs/2108.12409), MPT-7B-StoryWriter-65k+ can extrapolate even beyond 65k tokens.
+We demonstrate generations as long as 84k tokens on a single A100-80GB GPU in our [blogpost](www.mosaicml.com/blog/mpt-7b).
+  * License: _Apache-2.0_ (commercial use permitted)
+This model was trained by [MosaicML](https://www.mosaicml.com) and follows a modified decoder-only transformer architecture.
+## Model Date
+May 5, 2023
+## Model License
+Apache-2.0 (commercial use permitted)
+## Documentation
+* [Blog post: Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs](www.mosaicml.com/blog/mpt-7b)
+* [Codebase (mosaicml/llm-foundry repo)](https://github.com/mosaicml/llm-foundry/)
+* Questions: Feel free to contact us via the [MosaicML Community Slack](https://join.slack.com/t/mosaicml-community/shared_invite/zt-w0tiddn9-WGTlRpfjcO9J5jyrMub1dg)!
+## How to Use
+Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method. This is because we use a custom model architecture that is not yet part of the `transformers` package.
+It includes options for many training efficiency features such as [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi](https://arxiv.org/abs/2108.12409), QK LayerNorm, and more.
+```python
+import transformers
+model = transformers.AutoModelForCausalLM.from_pretrained('mosaicml/mpt-7b-storywriter', trust_remote_code=True, torch_dtype=torch.bfloat16)
+```
+To use the optimized triton implementation of FlashAttention, you can load with `attn_impl='triton'` and move the model to `bfloat16` like so:
+```python
+model = transformers.AutoModelForCausalLM.from_pretrained('mosaicml/mpt-7b-storywriter', trust_remote_code=True, torch_dtype=torch.bfloat16, attn_impl='triton')
+model.to(device='cuda:0', dtype=torch.bfloat16)
+```
+Although the model was trained with a sequence length of 2048, ALiBi enables users to increase the maximum sequence length during finetuning and/or inference. For example:
+```python
+config = transformers.AutoConfig.from_pretrained('mosaicml/mpt-7b-storywriter', trust_remote_code=True)
+config.update({"max_seq_len": 4096})
+model = transformers.AutoModelForCausalLM.from_pretrained('mosaicml/mpt-7b-storywriter', config=config, trust_remote_code=True)
+```
+This model was trained with the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer.
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
+```
+## Model Description
+The architecture is a modification of a standard decoder-only transformer.
+The model has been modified from a standard transformer in the following ways:
+* It uses [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf)
+* It uses [ALiBi (Attention with Linear Biases)](https://arxiv.org/abs/2108.12409) and does not use positional embeddings
+* It does not use biases
+| Hyperparameter | Value |
+|----------------|-------|
+|n_parameters | 6.7B |
+|n_layers | 32 |
+| n_heads | 32 |
+| d_model | 4096 |
+| vocab size | 50432 |
+| sequence length | **65536** |
+## PreTraining Data
+For more details on the pretraining process, see [MPT-7B](https://huggingface.co/mosaicml/mpt-7b).
+The data was tokenized using the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer.
+## Limitations and Biases
+_The following language is modified from [EleutherAI's GPT-NeoX-20B](https://huggingface.co/EleutherAI/gpt-neox-20b)_
+MPT-7B-Instruct can produce factually incorrect output, and should not be relied on to produce factually accurate information.
+MPT-7B-Instruct was trained on various public datasets.
+While great efforts have been taken to clean the pretraining data, it is possible that this model could generate lewd, biased or otherwise offensive outputs.
+## Acknowledgements
+This model was finetuned by Alex Trott and the MosaicML NLP team
+## Citation
+Please cite this model using the following format:
+```
+@online{MosaicML2023Introducing,
+    author    = {MosaicML NLP Team},
+    title     = {Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs},
+    year      = {2023},
+    url       = {www.mosaicml.com/blog/mpt-7b},
+    note      = {Accessed: 2023-03-28}, % change this date
+    urldate   = {2023-03-28} % change this date
+}
+```