abhi-mosaic commited on
Commit
1975e8d
1 Parent(s): b72c1cd

update README

Browse files
Files changed (1) hide show
  1. README.md +37 -33
README.md CHANGED
@@ -19,12 +19,12 @@ inference: false
19
  MPT-7B is a decoder-style transformer pretrained from scratch on 1T tokens of English text and code.
20
  This model was trained by [MosaicML](https://www.mosaicml.com).
21
 
22
- MPT-7B is part of the family of MosaicPretrainedTransformer (MPT) models, which use a modified transformer architecture optimized for efficient training and inference.
23
 
24
- These architectural changes include performance-optimized layer implementations and the elimination of context length limits by replacing
25
- positional embeddings with Attention with Linear Biases ([ALiBi](https://arxiv.org/abs/2108.12409)).
26
- Thanks to these modifications, MPT models can be trained with high throughput efficiency and stable convergence.
27
- MPT models can also be served efficiently with both standard HuggingFace pipelines and NVIDIA's [FasterTransformer](https://github.com/NVIDIA/FasterTransformer).
28
 
29
  This model uses the MosaicML LLM codebase, which can be found in the [llm-foundry repository](https://github.com/mosaicml/llm-foundry). It was trained by MosaicML’s NLP team on the [MosaicML platform](https://www.mosaicml.com/training) for LLM pretraining, finetuning, and inference.
30
 
@@ -49,7 +49,7 @@ We demonstrate generations as long as 80k tokens on a single A100-80GB GPU in ou
49
  * License: Apache 2.0
50
 
51
  * [MPT-7B-Instruct](https://huggingface.co/mosaicml/mpt-7b-instruct): a model for short-form instruction following.
52
- Built by finetuning MPT-7B on a [dataset](https://huggingface.co/datasets/mosaicml/dolly_hhrlhf) we also release, derived from the [Databricks Dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) and the [Anthropic Helpful and Harmless (HH-RLHF)](https://huggingface.co/datasets/Anthropic/hh-rlhf) datasets.
53
  * License: _CC-By-SA-3.0_
54
  * [Demo on Hugging Face Spaces](https://huggingface.co/spaces/mosaicml/mpt-7b-instruct)
55
 
@@ -85,37 +85,41 @@ model = transformers.AutoModelForCausalLM.from_pretrained(
85
  trust_remote_code=True
86
  )
87
  ```
88
- Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method.
89
  This is because we use a custom `MPT` model architecture that is not yet part of the Hugging Face `transformers` package.
90
  `MPT` includes options for many training efficiency features such as [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi](https://arxiv.org/abs/2108.12409), [QK LayerNorm](https://arxiv.org/abs/2010.04245), and more.
91
 
92
- To use the optimized [triton implementation](https://github.com/openai/triton) of FlashAttention, you can load the model with `attn_impl='triton'` and move the model to `bfloat16`:
93
  ```python
94
- config = transformers.AutoConfig.from_pretrained(
95
- 'mosaicml/mpt-7b',
96
- trust_remote_code=True
97
- )
 
 
98
  config.attn_config['attn_impl'] = 'triton'
 
99
 
100
  model = transformers.AutoModelForCausalLM.from_pretrained(
101
- 'mosaicml/mpt-7b',
102
  config=config,
103
- torch_dtype=torch.bfloat16,
104
  trust_remote_code=True
105
  )
106
- model.to(device='cuda:0')
107
  ```
108
 
109
  Although the model was trained with a sequence length of 2048, ALiBi enables users to increase the maximum sequence length during finetuning and/or inference. For example:
110
 
111
  ```python
112
- config = transformers.AutoConfig.from_pretrained(
113
- 'mosaicml/mpt-7b',
114
- trust_remote_code=True
115
- )
116
- config.update({"max_seq_len": 4096})
 
 
117
  model = transformers.AutoModelForCausalLM.from_pretrained(
118
- 'mosaicml/mpt-7b',
119
  config=config,
120
  trust_remote_code=True
121
  )
@@ -125,7 +129,7 @@ This model was trained with the [EleutherAI/gpt-neox-20b](https://huggingface.co
125
 
126
  ```python
127
  from transformers import AutoTokenizer
128
- tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
129
  ```
130
 
131
  ## Model Description
@@ -153,7 +157,7 @@ The model has been modified from a standard transformer in the following ways:
153
 
154
  ### Streaming Datasets
155
 
156
- Data was formatted using the MosaicML [StreamingDataset](https://github.com/mosaicml/streaming) library to host our data in object storage and efficiently stream it to our compute cluster during training.
157
  StreamingDataset obviates the need to download the whole dataset before starting training, and allows instant resumption of training from any point in the dataset.
158
 
159
 
@@ -178,24 +182,24 @@ The model was trained for 1T tokens (with batch size 1760 and sequence length 20
178
  Samples for each batch were selected from one of the datasets with the probability specified above.
179
  The examples were shuffled within each dataset, and each example was constructed from as many sequences from that dataset as were necessary to fill the 2048 sequence length.
180
 
181
- The data was tokenized using the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer. This BPE tokenizer has a number of desirable characteristics,
182
- most of which are relevant for tokenizing code:
183
- (1) It was trained on a diverse mix of data that includes code (The Pile)
184
- (2) It applies consistent space delimitation, unlike the GPT2 tokenizer which tokenizes inconsistently depending on the presence of prefix spaces
185
- (3) It contains tokens for repeated space characters, which allows superior compression of text with large amounts of repeated space characters.
186
 
187
  The model vocabulary size of 50432 was set to be a multiple of 128 (as in [MEGATRON-LM](https://arxiv.org/abs/1909.08053)), model flop utilization (MFU) increased by up to four percentage points.
188
 
189
  ### Training Configuration
190
 
191
- This model was trained on 440 A100-40GBs for about 9.5 days using the [MosaicML Platform](https://www.mosaicml.com/platform).
192
- The model was trained with sharded data parallelism using [FSDP](https://pytorch.org/docs/stable/fsdp.html) and used the [LION](https://arxiv.org/abs/2302.06675) optimizer.
193
 
194
  ## Limitations and Biases
195
 
196
  _The following language is modified from [EleutherAI's GPT-NeoX-20B](https://huggingface.co/EleutherAI/gpt-neox-20b)_
197
 
198
- MPT-7B (Base) is **not** intended for deployment without finetuning.
199
  It should not be used for human-facing interactions without further guardrails and user consent.
200
 
201
  MPT-7B can produce factually incorrect output, and should not be relied on to produce factually accurate information.
@@ -218,11 +222,11 @@ Please cite this model using the following format:
218
  ```
219
  @online{MosaicML2023Introducing,
220
  author = {MosaicML NLP Team},
221
- title = {Introducing MPT-7B: A New Standard for Open-Source,
222
  ly Usable LLMs},
223
  year = {2023},
224
  url = {www.mosaicml.com/blog/mpt-7b},
225
  note = {Accessed: 2023-03-28}, % change this date
226
  urldate = {2023-03-28} % change this date
227
  }
228
- ```
 
19
  MPT-7B is a decoder-style transformer pretrained from scratch on 1T tokens of English text and code.
20
  This model was trained by [MosaicML](https://www.mosaicml.com).
21
 
22
+ MPT-7B is part of the family of MosaicPretrainedTransformer (MPT) models, which use a modified transformer architecture optimized for efficient training and inference.
23
 
24
+ These architectural changes include performance-optimized layer implementations and the elimination of context length limits by replacing
25
+ positional embeddings with Attention with Linear Biases ([ALiBi](https://arxiv.org/abs/2108.12409)).
26
+ Thanks to these modifications, MPT models can be trained with high throughput efficiency and stable convergence.
27
+ MPT models can also be served efficiently with both standard HuggingFace pipelines and NVIDIA's [FasterTransformer](https://github.com/NVIDIA/FasterTransformer).
28
 
29
  This model uses the MosaicML LLM codebase, which can be found in the [llm-foundry repository](https://github.com/mosaicml/llm-foundry). It was trained by MosaicML’s NLP team on the [MosaicML platform](https://www.mosaicml.com/training) for LLM pretraining, finetuning, and inference.
30
 
 
49
  * License: Apache 2.0
50
 
51
  * [MPT-7B-Instruct](https://huggingface.co/mosaicml/mpt-7b-instruct): a model for short-form instruction following.
52
+ Built by finetuning MPT-7B on a [dataset](https://huggingface.co/datasets/mosaicml/dolly_hhrlhf) we also release, derived from the [Databricks Dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) and the [Anthropic Helpful and Harmless (HH-RLHF)](https://huggingface.co/datasets/Anthropic/hh-rlhf) datasets.
53
  * License: _CC-By-SA-3.0_
54
  * [Demo on Hugging Face Spaces](https://huggingface.co/spaces/mosaicml/mpt-7b-instruct)
55
 
 
85
  trust_remote_code=True
86
  )
87
  ```
88
+ Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method.
89
  This is because we use a custom `MPT` model architecture that is not yet part of the Hugging Face `transformers` package.
90
  `MPT` includes options for many training efficiency features such as [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi](https://arxiv.org/abs/2108.12409), [QK LayerNorm](https://arxiv.org/abs/2010.04245), and more.
91
 
92
+ To use the optimized [triton implementation](https://github.com/openai/triton) of FlashAttention, you can load the model on GPU (`cuda:0`) with `attn_impl='triton'` and with `bfloat16` precision:
93
  ```python
94
+ import torch
95
+ import transformers
96
+
97
+ name = 'mosaicml/mpt-7b'
98
+
99
+ config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
100
  config.attn_config['attn_impl'] = 'triton'
101
+ config.init_device = 'cuda:0' # For fast initialization directly on GPU!
102
 
103
  model = transformers.AutoModelForCausalLM.from_pretrained(
104
+ name,
105
  config=config,
106
+ torch_dtype=torch.bfloat16, # Load model weights in bfloat16
107
  trust_remote_code=True
108
  )
 
109
  ```
110
 
111
  Although the model was trained with a sequence length of 2048, ALiBi enables users to increase the maximum sequence length during finetuning and/or inference. For example:
112
 
113
  ```python
114
+ import transformers
115
+
116
+ name = 'mosaicml/mpt-7b'
117
+
118
+ config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
119
+ config.max_seq_len = 4096 # (input + output) tokens can now be up to 4096
120
+
121
  model = transformers.AutoModelForCausalLM.from_pretrained(
122
+ name,
123
  config=config,
124
  trust_remote_code=True
125
  )
 
129
 
130
  ```python
131
  from transformers import AutoTokenizer
132
+ tokenizer = AutoTokenizer.from_pretrained('EleutherAI/gpt-neox-20b')
133
  ```
134
 
135
  ## Model Description
 
157
 
158
  ### Streaming Datasets
159
 
160
+ Data was formatted using the MosaicML [StreamingDataset](https://github.com/mosaicml/streaming) library to host our data in object storage and efficiently stream it to our compute cluster during training.
161
  StreamingDataset obviates the need to download the whole dataset before starting training, and allows instant resumption of training from any point in the dataset.
162
 
163
 
 
182
  Samples for each batch were selected from one of the datasets with the probability specified above.
183
  The examples were shuffled within each dataset, and each example was constructed from as many sequences from that dataset as were necessary to fill the 2048 sequence length.
184
 
185
+ The data was tokenized using the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer. This BPE tokenizer has a number of desirable characteristics,
186
+ most of which are relevant for tokenizing code:
187
+ (1) It was trained on a diverse mix of data that includes code (The Pile)
188
+ (2) It applies consistent space delimitation, unlike the GPT2 tokenizer which tokenizes inconsistently depending on the presence of prefix spaces
189
+ (3) It contains tokens for repeated space characters, which allows superior compression of text with large amounts of repeated space characters.
190
 
191
  The model vocabulary size of 50432 was set to be a multiple of 128 (as in [MEGATRON-LM](https://arxiv.org/abs/1909.08053)), model flop utilization (MFU) increased by up to four percentage points.
192
 
193
  ### Training Configuration
194
 
195
+ This model was trained on 440 A100-40GBs for about 9.5 days using the [MosaicML Platform](https://www.mosaicml.com/platform).
196
+ The model was trained with sharded data parallelism using [FSDP](https://pytorch.org/docs/stable/fsdp.html) and used the [LION](https://arxiv.org/abs/2302.06675) optimizer.
197
 
198
  ## Limitations and Biases
199
 
200
  _The following language is modified from [EleutherAI's GPT-NeoX-20B](https://huggingface.co/EleutherAI/gpt-neox-20b)_
201
 
202
+ MPT-7B (Base) is **not** intended for deployment without finetuning.
203
  It should not be used for human-facing interactions without further guardrails and user consent.
204
 
205
  MPT-7B can produce factually incorrect output, and should not be relied on to produce factually accurate information.
 
222
  ```
223
  @online{MosaicML2023Introducing,
224
  author = {MosaicML NLP Team},
225
+ title = {Introducing MPT-7B: A New Standard for Open-Source,
226
  ly Usable LLMs},
227
  year = {2023},
228
  url = {www.mosaicml.com/blog/mpt-7b},
229
  note = {Accessed: 2023-03-28}, % change this date
230
  urldate = {2023-03-28} % change this date
231
  }
232
+ ```