mosaicml
/

mpt-1b-redpajama-200b

Text Generation

Transformers

PyTorch

mosaic_gpt

custom_code

Model card Files Files and versions Community

abhi-mosaic commited on Apr 20, 2023

Commit

474a917

1 Parent(s): ee5674f

Update README.md

Browse files

Files changed (1) hide show

README.md +11 -10

README.md CHANGED Viewed

@@ -4,29 +4,30 @@ datasets:
 - togethercomputer/RedPajama-Data-1T
 ---
-# Mosaic-1b-RedPajama-200b
-Mosaic-1b-RedPajama-200b is a 1.4 billion parameter decoder-only transformer trained on the [RedPajama dataset](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T).
 The model was trained for 200B tokens by sampling from the subsets of the RedPajama dataset in the same proportions as were used by the [Llama series of models](https://arxiv.org/abs/2302.13971).
 This model was trained by [MosaicML](https://www.mosaicml.com) and follows a modified decoder-only transformer architecture.
 ## Model Date
-April 19, 2023
 ## How to Use
-Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method.
-This is because we train using [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), which is not part of the `transformers` library and depends on [Triton](https://github.com/openai/triton) and some custom PyTorch code.
 ```python
 import transformers
-model = transformers.AutoModelForCausalLM.from_pretrained('mosaicml/mosaic-llama-redpajama-final-candidate', trust_remote_code=True)```
 ```
 To use the optimized triton implementation of FlashAttention, you can load with `attn_impl='triton'` and move the model to `bfloat16` like so:
 ```python
-model = transformers.AutoModelForCausalLM.from_pretrained('mosaicml/mosaic-1b-redpajama-200b', trust_remote_code=True, attn_impl='triton')
 model.to(device='cuda:0', dtype=torch.bfloat16)
 ```
@@ -36,8 +37,8 @@ This model uses the MosaicML LLM codebase, which can be found in the [MosaicML E
 The architecture is a modification of a standard decoder-only transformer.
 The transformer has 24 layers, 16 attention heads, and width 2048.
 The model has been modified from a standard transformer in the following ways:
-* It uses FlashAttention.
-* It uses ALiBi position encodings.
 * It does not use biases.
 ## Training Data
@@ -61,7 +62,7 @@ The data was tokenized using the [EleutherAI/gpt-neox-20b](https://huggingface.c
 ## Training Configuration
-This model was trained on 440 A100-40GBs for about half a day using the [MosaicML Platform](https://www.mosaicml.com/platform). The model was trained in a data parallel manner using FSDP.
 ## Acknowledgements

 - togethercomputer/RedPajama-Data-1T
 ---
+# MPT-1b-RedPajama-200b
+MPT-1b-RedPajama-200b is a 1.3 billion parameter decoder-only transformer trained on the [RedPajama dataset](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T).
 The model was trained for 200B tokens by sampling from the subsets of the RedPajama dataset in the same proportions as were used by the [Llama series of models](https://arxiv.org/abs/2302.13971).
 This model was trained by [MosaicML](https://www.mosaicml.com) and follows a modified decoder-only transformer architecture.
 ## Model Date
+April 20, 2023
 ## How to Use
+Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method.
+This is because we use a custom model architecture `MosaicGPT` that is not yet part of the `transformers` package.
+`MosaicGPT` includes options for many training efficiency features such as [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), [ALIBI](https://arxiv.org/abs/2108.12409), QK LayerNorm, and more.
 ```python
 import transformers
+model = transformers.AutoModelForCausalLM.from_pretrained('mosaicml/mpt-1b-redpajama-200b', trust_remote_code=True)```
 ```
 To use the optimized triton implementation of FlashAttention, you can load with `attn_impl='triton'` and move the model to `bfloat16` like so:
 ```python
+model = transformers.AutoModelForCausalLM.from_pretrained('mosaicml/mpt-1b-redpajama-200b', trust_remote_code=True, attn_impl='triton')
 model.to(device='cuda:0', dtype=torch.bfloat16)
 ```
 The architecture is a modification of a standard decoder-only transformer.
 The transformer has 24 layers, 16 attention heads, and width 2048.
 The model has been modified from a standard transformer in the following ways:
+* It uses ALiBi and does not use positional embeddings.
+* It uses QK LayerNorm.
 * It does not use biases.
 ## Training Data
 ## Training Configuration
+This model was trained on 440 A100-40GBs for about half a day using the [MosaicML Platform](https://www.mosaicml.com/platform). The model was trained with sharded data parallelism using FSDP.
 ## Acknowledgements