Text Generation
Transformers
PyTorch
mosaic_gpt
custom_code
abhi-mosaic commited on
Commit
474a917
1 Parent(s): ee5674f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -10
README.md CHANGED
@@ -4,29 +4,30 @@ datasets:
4
  - togethercomputer/RedPajama-Data-1T
5
  ---
6
 
7
- # Mosaic-1b-RedPajama-200b
8
 
9
- Mosaic-1b-RedPajama-200b is a 1.4 billion parameter decoder-only transformer trained on the [RedPajama dataset](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T).
10
  The model was trained for 200B tokens by sampling from the subsets of the RedPajama dataset in the same proportions as were used by the [Llama series of models](https://arxiv.org/abs/2302.13971).
11
  This model was trained by [MosaicML](https://www.mosaicml.com) and follows a modified decoder-only transformer architecture.
12
 
13
  ## Model Date
14
 
15
- April 19, 2023
16
 
17
  ## How to Use
18
 
19
- Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method.
20
- This is because we train using [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), which is not part of the `transformers` library and depends on [Triton](https://github.com/openai/triton) and some custom PyTorch code.
 
21
 
22
  ```python
23
  import transformers
24
- model = transformers.AutoModelForCausalLM.from_pretrained('mosaicml/mosaic-llama-redpajama-final-candidate', trust_remote_code=True)```
25
  ```
26
 
27
  To use the optimized triton implementation of FlashAttention, you can load with `attn_impl='triton'` and move the model to `bfloat16` like so:
28
  ```python
29
- model = transformers.AutoModelForCausalLM.from_pretrained('mosaicml/mosaic-1b-redpajama-200b', trust_remote_code=True, attn_impl='triton')
30
  model.to(device='cuda:0', dtype=torch.bfloat16)
31
  ```
32
 
@@ -36,8 +37,8 @@ This model uses the MosaicML LLM codebase, which can be found in the [MosaicML E
36
  The architecture is a modification of a standard decoder-only transformer.
37
  The transformer has 24 layers, 16 attention heads, and width 2048.
38
  The model has been modified from a standard transformer in the following ways:
39
- * It uses FlashAttention.
40
- * It uses ALiBi position encodings.
41
  * It does not use biases.
42
 
43
  ## Training Data
@@ -61,7 +62,7 @@ The data was tokenized using the [EleutherAI/gpt-neox-20b](https://huggingface.c
61
 
62
  ## Training Configuration
63
 
64
- This model was trained on 440 A100-40GBs for about half a day using the [MosaicML Platform](https://www.mosaicml.com/platform). The model was trained in a data parallel manner using FSDP.
65
 
66
  ## Acknowledgements
67
 
 
4
  - togethercomputer/RedPajama-Data-1T
5
  ---
6
 
7
+ # MPT-1b-RedPajama-200b
8
 
9
+ MPT-1b-RedPajama-200b is a 1.3 billion parameter decoder-only transformer trained on the [RedPajama dataset](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T).
10
  The model was trained for 200B tokens by sampling from the subsets of the RedPajama dataset in the same proportions as were used by the [Llama series of models](https://arxiv.org/abs/2302.13971).
11
  This model was trained by [MosaicML](https://www.mosaicml.com) and follows a modified decoder-only transformer architecture.
12
 
13
  ## Model Date
14
 
15
+ April 20, 2023
16
 
17
  ## How to Use
18
 
19
+ Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method.
20
+ This is because we use a custom model architecture `MosaicGPT` that is not yet part of the `transformers` package.
21
+ `MosaicGPT` includes options for many training efficiency features such as [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), [ALIBI](https://arxiv.org/abs/2108.12409), QK LayerNorm, and more.
22
 
23
  ```python
24
  import transformers
25
+ model = transformers.AutoModelForCausalLM.from_pretrained('mosaicml/mpt-1b-redpajama-200b', trust_remote_code=True)```
26
  ```
27
 
28
  To use the optimized triton implementation of FlashAttention, you can load with `attn_impl='triton'` and move the model to `bfloat16` like so:
29
  ```python
30
+ model = transformers.AutoModelForCausalLM.from_pretrained('mosaicml/mpt-1b-redpajama-200b', trust_remote_code=True, attn_impl='triton')
31
  model.to(device='cuda:0', dtype=torch.bfloat16)
32
  ```
33
 
 
37
  The architecture is a modification of a standard decoder-only transformer.
38
  The transformer has 24 layers, 16 attention heads, and width 2048.
39
  The model has been modified from a standard transformer in the following ways:
40
+ * It uses ALiBi and does not use positional embeddings.
41
+ * It uses QK LayerNorm.
42
  * It does not use biases.
43
 
44
  ## Training Data
 
62
 
63
  ## Training Configuration
64
 
65
+ This model was trained on 440 A100-40GBs for about half a day using the [MosaicML Platform](https://www.mosaicml.com/platform). The model was trained with sharded data parallelism using FSDP.
66
 
67
  ## Acknowledgements
68