Text Generation
Transformers
PyTorch
mosaic_gpt
custom_code
jfrankle commited on
Commit
e4e26cc
1 Parent(s): bf8aeda

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -8
README.md CHANGED
@@ -1,15 +1,17 @@
1
  ---
2
- license: apache-2.0
3
  datasets:
4
  - togethercomputer/RedPajama-Data-1T
5
  ---
6
 
7
  # MPT-1b-RedPajama-200b
8
 
9
- MPT-1b-RedPajama-200b is a 1.3 billion parameter decoder-only transformer trained on the [RedPajama dataset](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T).
10
- The model was trained for 200B tokens by sampling from the subsets of the RedPajama dataset in the same proportions as were used by the [Llama series of models](https://arxiv.org/abs/2302.13971).
11
  This model was trained by [MosaicML](https://www.mosaicml.com) and follows a modified decoder-only transformer architecture.
12
 
 
 
13
  ## Model Date
14
 
15
  April 20, 2023
@@ -22,12 +24,12 @@ This is because we use a custom model architecture `MosaicGPT` that is not yet p
22
 
23
  ```python
24
  import transformers
25
- model = transformers.AutoModelForCausalLM.from_pretrained('mosaicml/mpt-1b-redpajama-200b', trust_remote_code=True)```
26
  ```
27
 
28
  To use the optimized triton implementation of FlashAttention, you can load with `attn_impl='triton'` and move the model to `bfloat16` like so:
29
  ```python
30
- model = transformers.AutoModelForCausalLM.from_pretrained('mosaicml/mpt-1b-redpajama-200b', trust_remote_code=True, attn_impl='triton')
31
  model.to(device='cuda:0', dtype=torch.bfloat16)
32
  ```
33
 
@@ -43,7 +45,9 @@ The model has been modified from a standard transformer in the following ways:
43
 
44
  ## Training Data
45
 
46
- The model was trained for 200B tokens (batch size 2200, sequence length 2048). It was trained on the following data mix:
 
 
47
  * 67% RedPajama Common Crawl
48
  * 15% [C4](https://huggingface.co/datasets/c4)
49
  * 4.5% RedPajama GitHub
@@ -60,13 +64,19 @@ Each example was constructed from as many sequences from that dataset as were ne
60
 
61
  The data was tokenized using the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer.
62
 
 
 
 
 
63
  ## Training Configuration
64
 
65
- This model was trained on 440 A100-40GBs for about half a day using the [MosaicML Platform](https://www.mosaicml.com/platform). The model was trained with sharded data parallelism using FSDP.
66
 
67
  ## Acknowledgements
68
 
69
  This model builds on the work of [Together](https://www.together.xyz), which created the RedPajama dataset with the goal of mimicking the training data used to create the Llama series of models.
70
  We gratefully acknowledge the hard work of the team that put together this dataset, and we hope this model serves as a useful companion to that work.
71
 
72
- We also gratefully acknowledge the work of the researchers who created the Llama series of models, which was the impetus for our efforts and those who worked on the RedPajama project.
 
 
 
1
  ---
2
+ license: cc-by-sa-3.0
3
  datasets:
4
  - togethercomputer/RedPajama-Data-1T
5
  ---
6
 
7
  # MPT-1b-RedPajama-200b
8
 
9
+ MPT-1b-RedPajama-200b is a 1.3 billion parameter decoder-only transformer pre-trained on the [RedPajama dataset](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) and subsequently fine-tuned on the [Databricks Dolly](https://github.com/databrickslabs/dolly/tree/master/data) instruction dataset.
10
+ The model was pre-trained for 200B tokens by sampling from the subsets of the RedPajama dataset in the same proportions as were used by the [Llama series of models](https://arxiv.org/abs/2302.13971).
11
  This model was trained by [MosaicML](https://www.mosaicml.com) and follows a modified decoder-only transformer architecture.
12
 
13
+ This model is an instruction fine-tuned version of [mpt-1b-redpajama-200b](https://huggingface.co/mosaicml/mpt-1b-redpajama-200b). In other words, the pre-trained version of this model is [mpt-1b-redpajama-200b](https://huggingface.co/mosaicml/mpt-1b-redpajama-200b).
14
+
15
  ## Model Date
16
 
17
  April 20, 2023
 
24
 
25
  ```python
26
  import transformers
27
+ model = transformers.AutoModelForCausalLM.from_pretrained('mosaicml/mpt-1b-redpajama-200b-dolly', trust_remote_code=True)
28
  ```
29
 
30
  To use the optimized triton implementation of FlashAttention, you can load with `attn_impl='triton'` and move the model to `bfloat16` like so:
31
  ```python
32
+ model = transformers.AutoModelForCausalLM.from_pretrained('mosaicml/mpt-1b-redpajama-200b-dolly', trust_remote_code=True, attn_impl='triton')
33
  model.to(device='cuda:0', dtype=torch.bfloat16)
34
  ```
35
 
 
45
 
46
  ## Training Data
47
 
48
+ ### Pre-Training
49
+
50
+ The model was pre-trained for 200B tokens (batch size 2200, sequence length 2048). It was trained on the following data mix:
51
  * 67% RedPajama Common Crawl
52
  * 15% [C4](https://huggingface.co/datasets/c4)
53
  * 4.5% RedPajama GitHub
 
64
 
65
  The data was tokenized using the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer.
66
 
67
+ ### Fine-Tuning
68
+
69
+ TODO
70
+
71
  ## Training Configuration
72
 
73
+ This model was pre-trained on 440 A100-40GBs for about half a day using the [MosaicML Platform](https://www.mosaicml.com/platform). The model was pre-trained with sharded data parallelism using FSDP.
74
 
75
  ## Acknowledgements
76
 
77
  This model builds on the work of [Together](https://www.together.xyz), which created the RedPajama dataset with the goal of mimicking the training data used to create the Llama series of models.
78
  We gratefully acknowledge the hard work of the team that put together this dataset, and we hope this model serves as a useful companion to that work.
79
 
80
+ This model also builds on the work of [Databricks](https://www.databricks.com/), which created the Dolly instruction fine-tuning dataset.
81
+
82
+ We also gratefully acknowledge the work of the researchers who created the Llama series of models, which was the impetus for our efforts and those who worked on the RedPajama project.