mosaicml
/

mpt-7b

@@ -7,46 +7,50 @@ tags:
 - StreamingDatasets
 ---
-# MPT-7B (Base)
-MPT-7B (Base) is a decoder-style transformer pretrained from scratch on 1T tokens of English text and code.
 This model was trained by [MosaicML](https://www.mosaicml.com) and is **open-sourced for commercial use** (_Apache-2.0_).
 MPT-7B is part of the family of MosaicPretrainedTransformer (MPT) models, which use a modified transformer architecture optimized for efficient training and inference.
-These architectural changes include performance-optimized layer implementations, changes that provide greater training stability, and the elimination of context length limits by replacing
 positional embeddings with Attention with Linear Biases ([ALiBi](https://arxiv.org/abs/2108.12409)).
-Thanks to these modifications, MPT models can be trained with high throughput efficiency and highly stable convergence.
 MPT models can also be served efficiently with both standard HuggingFace pipelines and NVIDIA's [FasterTransformer](https://github.com/NVIDIA/FasterTransformer).
-This model uses the MosaicML LLM codebase, which can be found in the [llm-foundry repository](https://github.com/mosaicml/llm-foundry), and was built by MosaicML’s NLP team on the [MosaicML platform](https://www.mosaicml.com/training) for pretraining, finetuning and/or deploying LLMs for inference.
 ### How is this model different?
 * **Licensed for commercial use** (unlike [LLaMA](https://arxiv.org/abs/2302.13971)).
 * **Trained on a large amount of data** (1T tokens like [LLaMA](https://arxiv.org/abs/2302.13971) vs. 300B for [Pythia](https://github.com/EleutherAI/pythia), 300B for [OpenLLaMA](https://github.com/openlm-research/open_llama), and 800B for [StableLM](https://github.com/Stability-AI/StableLM)).
-* **Prepared to handle extremely long inputs** thanks to [ALiBi](https://arxiv.org/abs/2108.12409) (we trained on up to 65k inputs and can handle up to 84k vs. 2k-4k for other open source models).
-* **Capable of fast training and inference** (via [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf) and FasterTransformer)
 * **Equipped with highly efficient open-source training code** via the [llm-foundry repository](https://github.com/mosaicml/llm-foundry)
-### Models finetuned off MPT-7B (Base):
 * [MPT-7B-StoryWriter-65k+](https://huggingface.co/mosaicml/mpt-7b-storywriter): a model designed to read and write fictional stories with super long context lengths.
-It is built by finetuning MPT-7B with a context length of 65k tokens on a filtered fiction subset of the [books3 dataset](https://huggingface.co/datasets/the_pile_books3).
 At inference time, thanks to [ALiBi](https://arxiv.org/abs/2108.12409), MPT-7B-StoryWriter-65k+ can extrapolate even beyond 65k tokens.
-We demonstrate generations as long as 80k tokens on a single A100-80GB GPU in our blogpost {HERE}.
   * License: _Apache-2.0_ (commercial use permitted)
 * [MPT-7B-Instruct](https://huggingface.co/mosaicml/mpt-7b-instruct): a model for short-form instruction following.
-It is built by finetuning MPT-7B on a [dataset](https://huggingface.co/datasets/mosaicml/dolly_hhrlhf) we also release, derived from the [Databricks Dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) and the [Anthropic Helpful and Harmless (HH-RLHF)](https://huggingface.co/datasets/Anthropic/hh-rlhf) datasets.
   * License: _CC-By-SA-3.0_ (commercial use permitted)
-  * [Online Demo on HuggingFace Spaces](https://huggingface.co/spaces/mosaicml/mpt-7b-instruct)
 * [MPT-7B-Chat](TBD): a chatbot-like model for dialogue generation.
-It is built by finetuning MPT-7B on the [ShareGPT-Vicuna](https://huggingface.co/datasets/jeffwan/sharegpt_vicuna), [HC3](https://huggingface.co/datasets/Hello-SimpleAI/HC3),
  [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca), [HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf), and [Evol-Instruct](https://huggingface.co/datasets/victor123/evol_instruct_70k) datasets.
   * License: _CC-By-NC-SA-4.0_ (non-commercial use only)
-  * [Online Demo on HuggingFace Spaces](https://huggingface.co/spaces/mosaicml/mpt-7b-chat)
 ## Model Date
@@ -65,16 +69,15 @@ Apache-2.0 (commercial use permitted)
 ## How to Use
-This model is best used with the MosaicML [llm-foundry repository](https://github.com/mosaicml/llm-foundry) for training, finetuning, evaluating, and deploying LLMs for inference.
-Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method.
-This is because we use a custom `MPT` model architecture that is not yet part of the Hugging Face `transformers` package.
-`MPT` includes options for many training efficiency features such as [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi](https://arxiv.org/abs/2108.12409), [QK LayerNorm](https://arxiv.org/abs/2010.04245), and more.
 ```python
 import transformers
 model = transformers.AutoModelForCausalLM.from_pretrained('mosaicml/mpt-7b', trust_remote_code=True)
 ```
 To use the optimized [triton implementation](https://github.com/openai/triton) of FlashAttention (`pip install flash_attn`), you can load the model with `attn_impl='triton'` and move the model to `bfloat16`:
 ```python
@@ -85,7 +88,7 @@ model = transformers.AutoModelForCausalLM.from_pretrained('mosaicml/mpt-7b', con
 model.to(device='cuda:0')
 ```
-Although the model was trained with a sequence length of 2048, ALiBi enables users to increase the maximum sequence length during finetuning and/or deployment. For example:
 ```python
 config = transformers.AutoConfig.from_pretrained('mosaicml/mpt-7b', trust_remote_code=True)
@@ -156,7 +159,6 @@ most of which are relevant for tokenizing code:
 (2) It applies consistent space delimitation, unlike the GPT2 tokenizer which tokenizes inconsistently depending on the presence of prefix spaces
 (3) It contains tokens for repeated space characters, which allows superior compression of text with large amounts of repeated space characters.
 The model vocabulary size of 50432 was set to be a multiple of 128 (as in [MEGATRON-LM](https://arxiv.org/abs/1909.08053)), model flop utilization (MFU) increased by up to four percentage points.
 ### Training Configuration
@@ -170,6 +172,7 @@ _The following language is modified from [EleutherAI's GPT-NeoX-20B](https://hug
 MPT-7B (Base) is **not** intended for deployment without finetuning.
 It should not be used for human-facing interactions without further guardrails and user consent.
 MPT-7B can produce factually incorrect output, and should not be relied on to produce factually accurate information.
 MPT-7B was trained on various public datasets detailed below including [C4](https://huggingface.co/datasets/c4), the colossal, cleaned version of Common Crawl's web crawl corpus.
 While great efforts have been taken to clean the pretraining data, it is possible that this model could generate lewd, biased or otherwise offensive outputs.

 - StreamingDatasets
 ---
+# MPT-7B
+MPT-7B is a decoder-style transformer pretrained from scratch on 1T tokens of English text and code.
 This model was trained by [MosaicML](https://www.mosaicml.com) and is **open-sourced for commercial use** (_Apache-2.0_).
 MPT-7B is part of the family of MosaicPretrainedTransformer (MPT) models, which use a modified transformer architecture optimized for efficient training and inference.
+These architectural changes include performance-optimized layer implementations and the elimination of context length limits by replacing
 positional embeddings with Attention with Linear Biases ([ALiBi](https://arxiv.org/abs/2108.12409)).
+Thanks to these modifications, MPT models can be trained with high throughput efficiency and stable convergence.
 MPT models can also be served efficiently with both standard HuggingFace pipelines and NVIDIA's [FasterTransformer](https://github.com/NVIDIA/FasterTransformer).
+This model uses the MosaicML LLM codebase, which can be found in the [llm-foundry repository](https://github.com/mosaicml/llm-foundry). It was trained by MosaicML’s NLP team on the [MosaicML platform](https://www.mosaicml.com/training) for LLM pretraining, finetuning, and inference.
 ### How is this model different?
+MPT-7B is
 * **Licensed for commercial use** (unlike [LLaMA](https://arxiv.org/abs/2302.13971)).
 * **Trained on a large amount of data** (1T tokens like [LLaMA](https://arxiv.org/abs/2302.13971) vs. 300B for [Pythia](https://github.com/EleutherAI/pythia), 300B for [OpenLLaMA](https://github.com/openlm-research/open_llama), and 800B for [StableLM](https://github.com/Stability-AI/StableLM)).
+* **Prepared to handle extremely long inputs** thanks to [ALiBi](https://arxiv.org/abs/2108.12409) (we finetuned [MPT-7B-StoryWriter-65k+](https://huggingface.co/mosaicml/mpt-7b-storywriter) on up to 65k inputs and can handle up to 84k vs. 2k-4k for other open source models).
+* **Capable of fast training and inference** (via [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf) and [FasterTransformer](https://github.com/NVIDIA/FasterTransformer))
 * **Equipped with highly efficient open-source training code** via the [llm-foundry repository](https://github.com/mosaicml/llm-foundry)
+### Models finetuned off MPT-7B:
+The following models are finetuned on MPT-7B:
 * [MPT-7B-StoryWriter-65k+](https://huggingface.co/mosaicml/mpt-7b-storywriter): a model designed to read and write fictional stories with super long context lengths.
+Built by finetuning MPT-7B with a context length of 65k tokens on a filtered fiction subset of the [books3 dataset](https://huggingface.co/datasets/the_pile_books3).
 At inference time, thanks to [ALiBi](https://arxiv.org/abs/2108.12409), MPT-7B-StoryWriter-65k+ can extrapolate even beyond 65k tokens.
+We demonstrate generations as long as 80k tokens on a single A100-80GB GPU in our [blogpost](www.mosaicml.com/blog/mpt-7b).
   * License: _Apache-2.0_ (commercial use permitted)
 * [MPT-7B-Instruct](https://huggingface.co/mosaicml/mpt-7b-instruct): a model for short-form instruction following.
+Built by finetuning MPT-7B on a [dataset](https://huggingface.co/datasets/mosaicml/dolly_hhrlhf) we also release, derived from the [Databricks Dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) and the [Anthropic Helpful and Harmless (HH-RLHF)](https://huggingface.co/datasets/Anthropic/hh-rlhf) datasets.
   * License: _CC-By-SA-3.0_ (commercial use permitted)
+  * [Demo on Hugging Face Spaces](https://huggingface.co/spaces/mosaicml/mpt-7b-instruct)
 * [MPT-7B-Chat](TBD): a chatbot-like model for dialogue generation.
+Built by finetuning MPT-7B on the [ShareGPT-Vicuna](https://huggingface.co/datasets/jeffwan/sharegpt_vicuna), [HC3](https://huggingface.co/datasets/Hello-SimpleAI/HC3),
  [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca), [HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf), and [Evol-Instruct](https://huggingface.co/datasets/victor123/evol_instruct_70k) datasets.
   * License: _CC-By-NC-SA-4.0_ (non-commercial use only)
+  * [Demo on Hugging Face Spaces](https://huggingface.co/spaces/mosaicml/mpt-7b-chat)
 ## Model Date
 ## How to Use
+This model is best used with the MosaicML [llm-foundry repository](https://github.com/mosaicml/llm-foundry) for training and finetuning.
 ```python
 import transformers
 model = transformers.AutoModelForCausalLM.from_pretrained('mosaicml/mpt-7b', trust_remote_code=True)
 ```
+Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method.
+This is because we use a custom `MPT` model architecture that is not yet part of the Hugging Face `transformers` package.
+`MPT` includes options for many training efficiency features such as [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi](https://arxiv.org/abs/2108.12409), [QK LayerNorm](https://arxiv.org/abs/2010.04245), and more.
 To use the optimized [triton implementation](https://github.com/openai/triton) of FlashAttention (`pip install flash_attn`), you can load the model with `attn_impl='triton'` and move the model to `bfloat16`:
 ```python
 model.to(device='cuda:0')
 ```
+Although the model was trained with a sequence length of 2048, ALiBi enables users to increase the maximum sequence length during finetuning and/or inference. For example:
 ```python
 config = transformers.AutoConfig.from_pretrained('mosaicml/mpt-7b', trust_remote_code=True)
 (2) It applies consistent space delimitation, unlike the GPT2 tokenizer which tokenizes inconsistently depending on the presence of prefix spaces
 (3) It contains tokens for repeated space characters, which allows superior compression of text with large amounts of repeated space characters.
 The model vocabulary size of 50432 was set to be a multiple of 128 (as in [MEGATRON-LM](https://arxiv.org/abs/1909.08053)), model flop utilization (MFU) increased by up to four percentage points.
 ### Training Configuration
 MPT-7B (Base) is **not** intended for deployment without finetuning.
 It should not be used for human-facing interactions without further guardrails and user consent.
 MPT-7B can produce factually incorrect output, and should not be relied on to produce factually accurate information.
 MPT-7B was trained on various public datasets detailed below including [C4](https://huggingface.co/datasets/c4), the colossal, cleaned version of Common Crawl's web crawl corpus.
 While great efforts have been taken to clean the pretraining data, it is possible that this model could generate lewd, biased or otherwise offensive outputs.