jacobfulano commited on
Commit
faf0584
1 Parent(s): 6af8a50

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -21
README.md CHANGED
@@ -7,46 +7,50 @@ tags:
7
  - StreamingDatasets
8
  ---
9
 
10
- # MPT-7B (Base)
11
 
12
- MPT-7B (Base) is a decoder-style transformer pretrained from scratch on 1T tokens of English text and code.
13
  This model was trained by [MosaicML](https://www.mosaicml.com) and is **open-sourced for commercial use** (_Apache-2.0_).
14
 
15
  MPT-7B is part of the family of MosaicPretrainedTransformer (MPT) models, which use a modified transformer architecture optimized for efficient training and inference.
16
 
17
- These architectural changes include performance-optimized layer implementations, changes that provide greater training stability, and the elimination of context length limits by replacing
18
  positional embeddings with Attention with Linear Biases ([ALiBi](https://arxiv.org/abs/2108.12409)).
19
- Thanks to these modifications, MPT models can be trained with high throughput efficiency and highly stable convergence.
20
  MPT models can also be served efficiently with both standard HuggingFace pipelines and NVIDIA's [FasterTransformer](https://github.com/NVIDIA/FasterTransformer).
21
 
22
- This model uses the MosaicML LLM codebase, which can be found in the [llm-foundry repository](https://github.com/mosaicml/llm-foundry), and was built by MosaicML’s NLP team on the [MosaicML platform](https://www.mosaicml.com/training) for pretraining, finetuning and/or deploying LLMs for inference.
23
 
24
  ### How is this model different?
25
 
 
 
26
  * **Licensed for commercial use** (unlike [LLaMA](https://arxiv.org/abs/2302.13971)).
27
  * **Trained on a large amount of data** (1T tokens like [LLaMA](https://arxiv.org/abs/2302.13971) vs. 300B for [Pythia](https://github.com/EleutherAI/pythia), 300B for [OpenLLaMA](https://github.com/openlm-research/open_llama), and 800B for [StableLM](https://github.com/Stability-AI/StableLM)).
28
- * **Prepared to handle extremely long inputs** thanks to [ALiBi](https://arxiv.org/abs/2108.12409) (we trained on up to 65k inputs and can handle up to 84k vs. 2k-4k for other open source models).
29
- * **Capable of fast training and inference** (via [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf) and FasterTransformer)
30
  * **Equipped with highly efficient open-source training code** via the [llm-foundry repository](https://github.com/mosaicml/llm-foundry)
31
 
32
- ### Models finetuned off MPT-7B (Base):
 
 
33
 
34
  * [MPT-7B-StoryWriter-65k+](https://huggingface.co/mosaicml/mpt-7b-storywriter): a model designed to read and write fictional stories with super long context lengths.
35
- It is built by finetuning MPT-7B with a context length of 65k tokens on a filtered fiction subset of the [books3 dataset](https://huggingface.co/datasets/the_pile_books3).
36
  At inference time, thanks to [ALiBi](https://arxiv.org/abs/2108.12409), MPT-7B-StoryWriter-65k+ can extrapolate even beyond 65k tokens.
37
- We demonstrate generations as long as 80k tokens on a single A100-80GB GPU in our blogpost {HERE}.
38
  * License: _Apache-2.0_ (commercial use permitted)
39
 
40
  * [MPT-7B-Instruct](https://huggingface.co/mosaicml/mpt-7b-instruct): a model for short-form instruction following.
41
- It is built by finetuning MPT-7B on a [dataset](https://huggingface.co/datasets/mosaicml/dolly_hhrlhf) we also release, derived from the [Databricks Dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) and the [Anthropic Helpful and Harmless (HH-RLHF)](https://huggingface.co/datasets/Anthropic/hh-rlhf) datasets.
42
  * License: _CC-By-SA-3.0_ (commercial use permitted)
43
- * [Online Demo on HuggingFace Spaces](https://huggingface.co/spaces/mosaicml/mpt-7b-instruct)
44
 
45
  * [MPT-7B-Chat](TBD): a chatbot-like model for dialogue generation.
46
- It is built by finetuning MPT-7B on the [ShareGPT-Vicuna](https://huggingface.co/datasets/jeffwan/sharegpt_vicuna), [HC3](https://huggingface.co/datasets/Hello-SimpleAI/HC3),
47
  [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca), [HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf), and [Evol-Instruct](https://huggingface.co/datasets/victor123/evol_instruct_70k) datasets.
48
  * License: _CC-By-NC-SA-4.0_ (non-commercial use only)
49
- * [Online Demo on HuggingFace Spaces](https://huggingface.co/spaces/mosaicml/mpt-7b-chat)
50
 
51
  ## Model Date
52
 
@@ -65,16 +69,15 @@ Apache-2.0 (commercial use permitted)
65
 
66
  ## How to Use
67
 
68
- This model is best used with the MosaicML [llm-foundry repository](https://github.com/mosaicml/llm-foundry) for training, finetuning, evaluating, and deploying LLMs for inference.
69
-
70
- Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method.
71
- This is because we use a custom `MPT` model architecture that is not yet part of the Hugging Face `transformers` package.
72
- `MPT` includes options for many training efficiency features such as [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi](https://arxiv.org/abs/2108.12409), [QK LayerNorm](https://arxiv.org/abs/2010.04245), and more.
73
 
74
  ```python
75
  import transformers
76
  model = transformers.AutoModelForCausalLM.from_pretrained('mosaicml/mpt-7b', trust_remote_code=True)
77
  ```
 
 
 
78
 
79
  To use the optimized [triton implementation](https://github.com/openai/triton) of FlashAttention (`pip install flash_attn`), you can load the model with `attn_impl='triton'` and move the model to `bfloat16`:
80
  ```python
@@ -85,7 +88,7 @@ model = transformers.AutoModelForCausalLM.from_pretrained('mosaicml/mpt-7b', con
85
  model.to(device='cuda:0')
86
  ```
87
 
88
- Although the model was trained with a sequence length of 2048, ALiBi enables users to increase the maximum sequence length during finetuning and/or deployment. For example:
89
 
90
  ```python
91
  config = transformers.AutoConfig.from_pretrained('mosaicml/mpt-7b', trust_remote_code=True)
@@ -156,7 +159,6 @@ most of which are relevant for tokenizing code:
156
  (2) It applies consistent space delimitation, unlike the GPT2 tokenizer which tokenizes inconsistently depending on the presence of prefix spaces
157
  (3) It contains tokens for repeated space characters, which allows superior compression of text with large amounts of repeated space characters.
158
 
159
-
160
  The model vocabulary size of 50432 was set to be a multiple of 128 (as in [MEGATRON-LM](https://arxiv.org/abs/1909.08053)), model flop utilization (MFU) increased by up to four percentage points.
161
 
162
  ### Training Configuration
@@ -170,6 +172,7 @@ _The following language is modified from [EleutherAI's GPT-NeoX-20B](https://hug
170
 
171
  MPT-7B (Base) is **not** intended for deployment without finetuning.
172
  It should not be used for human-facing interactions without further guardrails and user consent.
 
173
  MPT-7B can produce factually incorrect output, and should not be relied on to produce factually accurate information.
174
  MPT-7B was trained on various public datasets detailed below including [C4](https://huggingface.co/datasets/c4), the colossal, cleaned version of Common Crawl's web crawl corpus.
175
  While great efforts have been taken to clean the pretraining data, it is possible that this model could generate lewd, biased or otherwise offensive outputs.
 
7
  - StreamingDatasets
8
  ---
9
 
10
+ # MPT-7B
11
 
12
+ MPT-7B is a decoder-style transformer pretrained from scratch on 1T tokens of English text and code.
13
  This model was trained by [MosaicML](https://www.mosaicml.com) and is **open-sourced for commercial use** (_Apache-2.0_).
14
 
15
  MPT-7B is part of the family of MosaicPretrainedTransformer (MPT) models, which use a modified transformer architecture optimized for efficient training and inference.
16
 
17
+ These architectural changes include performance-optimized layer implementations and the elimination of context length limits by replacing
18
  positional embeddings with Attention with Linear Biases ([ALiBi](https://arxiv.org/abs/2108.12409)).
19
+ Thanks to these modifications, MPT models can be trained with high throughput efficiency and stable convergence.
20
  MPT models can also be served efficiently with both standard HuggingFace pipelines and NVIDIA's [FasterTransformer](https://github.com/NVIDIA/FasterTransformer).
21
 
22
+ This model uses the MosaicML LLM codebase, which can be found in the [llm-foundry repository](https://github.com/mosaicml/llm-foundry). It was trained by MosaicML’s NLP team on the [MosaicML platform](https://www.mosaicml.com/training) for LLM pretraining, finetuning, and inference.
23
 
24
  ### How is this model different?
25
 
26
+ MPT-7B is
27
+
28
  * **Licensed for commercial use** (unlike [LLaMA](https://arxiv.org/abs/2302.13971)).
29
  * **Trained on a large amount of data** (1T tokens like [LLaMA](https://arxiv.org/abs/2302.13971) vs. 300B for [Pythia](https://github.com/EleutherAI/pythia), 300B for [OpenLLaMA](https://github.com/openlm-research/open_llama), and 800B for [StableLM](https://github.com/Stability-AI/StableLM)).
30
+ * **Prepared to handle extremely long inputs** thanks to [ALiBi](https://arxiv.org/abs/2108.12409) (we finetuned [MPT-7B-StoryWriter-65k+](https://huggingface.co/mosaicml/mpt-7b-storywriter) on up to 65k inputs and can handle up to 84k vs. 2k-4k for other open source models).
31
+ * **Capable of fast training and inference** (via [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf) and [FasterTransformer](https://github.com/NVIDIA/FasterTransformer))
32
  * **Equipped with highly efficient open-source training code** via the [llm-foundry repository](https://github.com/mosaicml/llm-foundry)
33
 
34
+ ### Models finetuned off MPT-7B:
35
+
36
+ The following models are finetuned on MPT-7B:
37
 
38
  * [MPT-7B-StoryWriter-65k+](https://huggingface.co/mosaicml/mpt-7b-storywriter): a model designed to read and write fictional stories with super long context lengths.
39
+ Built by finetuning MPT-7B with a context length of 65k tokens on a filtered fiction subset of the [books3 dataset](https://huggingface.co/datasets/the_pile_books3).
40
  At inference time, thanks to [ALiBi](https://arxiv.org/abs/2108.12409), MPT-7B-StoryWriter-65k+ can extrapolate even beyond 65k tokens.
41
+ We demonstrate generations as long as 80k tokens on a single A100-80GB GPU in our [blogpost](www.mosaicml.com/blog/mpt-7b).
42
  * License: _Apache-2.0_ (commercial use permitted)
43
 
44
  * [MPT-7B-Instruct](https://huggingface.co/mosaicml/mpt-7b-instruct): a model for short-form instruction following.
45
+ Built by finetuning MPT-7B on a [dataset](https://huggingface.co/datasets/mosaicml/dolly_hhrlhf) we also release, derived from the [Databricks Dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) and the [Anthropic Helpful and Harmless (HH-RLHF)](https://huggingface.co/datasets/Anthropic/hh-rlhf) datasets.
46
  * License: _CC-By-SA-3.0_ (commercial use permitted)
47
+ * [Demo on Hugging Face Spaces](https://huggingface.co/spaces/mosaicml/mpt-7b-instruct)
48
 
49
  * [MPT-7B-Chat](TBD): a chatbot-like model for dialogue generation.
50
+ Built by finetuning MPT-7B on the [ShareGPT-Vicuna](https://huggingface.co/datasets/jeffwan/sharegpt_vicuna), [HC3](https://huggingface.co/datasets/Hello-SimpleAI/HC3),
51
  [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca), [HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf), and [Evol-Instruct](https://huggingface.co/datasets/victor123/evol_instruct_70k) datasets.
52
  * License: _CC-By-NC-SA-4.0_ (non-commercial use only)
53
+ * [Demo on Hugging Face Spaces](https://huggingface.co/spaces/mosaicml/mpt-7b-chat)
54
 
55
  ## Model Date
56
 
 
69
 
70
  ## How to Use
71
 
72
+ This model is best used with the MosaicML [llm-foundry repository](https://github.com/mosaicml/llm-foundry) for training and finetuning.
 
 
 
 
73
 
74
  ```python
75
  import transformers
76
  model = transformers.AutoModelForCausalLM.from_pretrained('mosaicml/mpt-7b', trust_remote_code=True)
77
  ```
78
+ Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method.
79
+ This is because we use a custom `MPT` model architecture that is not yet part of the Hugging Face `transformers` package.
80
+ `MPT` includes options for many training efficiency features such as [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi](https://arxiv.org/abs/2108.12409), [QK LayerNorm](https://arxiv.org/abs/2010.04245), and more.
81
 
82
  To use the optimized [triton implementation](https://github.com/openai/triton) of FlashAttention (`pip install flash_attn`), you can load the model with `attn_impl='triton'` and move the model to `bfloat16`:
83
  ```python
 
88
  model.to(device='cuda:0')
89
  ```
90
 
91
+ Although the model was trained with a sequence length of 2048, ALiBi enables users to increase the maximum sequence length during finetuning and/or inference. For example:
92
 
93
  ```python
94
  config = transformers.AutoConfig.from_pretrained('mosaicml/mpt-7b', trust_remote_code=True)
 
159
  (2) It applies consistent space delimitation, unlike the GPT2 tokenizer which tokenizes inconsistently depending on the presence of prefix spaces
160
  (3) It contains tokens for repeated space characters, which allows superior compression of text with large amounts of repeated space characters.
161
 
 
162
  The model vocabulary size of 50432 was set to be a multiple of 128 (as in [MEGATRON-LM](https://arxiv.org/abs/1909.08053)), model flop utilization (MFU) increased by up to four percentage points.
163
 
164
  ### Training Configuration
 
172
 
173
  MPT-7B (Base) is **not** intended for deployment without finetuning.
174
  It should not be used for human-facing interactions without further guardrails and user consent.
175
+
176
  MPT-7B can produce factually incorrect output, and should not be relied on to produce factually accurate information.
177
  MPT-7B was trained on various public datasets detailed below including [C4](https://huggingface.co/datasets/c4), the colossal, cleaned version of Common Crawl's web crawl corpus.
178
  While great efforts have been taken to clean the pretraining data, it is possible that this model could generate lewd, biased or otherwise offensive outputs.