jacobfulano commited on
Commit
aa53cd9
1 Parent(s): 914523e

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +118 -0
README.md ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-sa-3.0
3
+ tags:
4
+ - Composer
5
+ - MosaicML
6
+ - llm-foundry
7
+ ---
8
+
9
+ # MPT-7B-StoryWriter-65k+
10
+
11
+ MPT-7B-StoryWriter-65k+ is a model designed to read and write fictional stories with super long context lengths.
12
+ It was built by finetuning MPT-7B with a context length of 65k tokens on a filtered fiction subset of the [books3 dataset](https://huggingface.co/datasets/the_pile_books3).
13
+ At inference time, thanks to [ALiBi](https://arxiv.org/abs/2108.12409), MPT-7B-StoryWriter-65k+ can extrapolate even beyond 65k tokens.
14
+ We demonstrate generations as long as 84k tokens on a single A100-80GB GPU in our [blogpost](www.mosaicml.com/blog/mpt-7b).
15
+ * License: _Apache-2.0_ (commercial use permitted)
16
+
17
+ This model was trained by [MosaicML](https://www.mosaicml.com) and follows a modified decoder-only transformer architecture.
18
+
19
+ ## Model Date
20
+
21
+ May 5, 2023
22
+
23
+ ## Model License
24
+
25
+ Apache-2.0 (commercial use permitted)
26
+
27
+ ## Documentation
28
+
29
+ * [Blog post: Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs](www.mosaicml.com/blog/mpt-7b)
30
+ * [Codebase (mosaicml/llm-foundry repo)](https://github.com/mosaicml/llm-foundry/)
31
+ * Questions: Feel free to contact us via the [MosaicML Community Slack](https://join.slack.com/t/mosaicml-community/shared_invite/zt-w0tiddn9-WGTlRpfjcO9J5jyrMub1dg)!
32
+
33
+
34
+ ## How to Use
35
+
36
+ Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method. This is because we use a custom model architecture that is not yet part of the `transformers` package.
37
+
38
+ It includes options for many training efficiency features such as [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi](https://arxiv.org/abs/2108.12409), QK LayerNorm, and more.
39
+
40
+ ```python
41
+ import transformers
42
+ model = transformers.AutoModelForCausalLM.from_pretrained('mosaicml/mpt-7b-storywriter', trust_remote_code=True, torch_dtype=torch.bfloat16)
43
+ ```
44
+
45
+ To use the optimized triton implementation of FlashAttention, you can load with `attn_impl='triton'` and move the model to `bfloat16` like so:
46
+
47
+ ```python
48
+ model = transformers.AutoModelForCausalLM.from_pretrained('mosaicml/mpt-7b-storywriter', trust_remote_code=True, torch_dtype=torch.bfloat16, attn_impl='triton')
49
+ model.to(device='cuda:0', dtype=torch.bfloat16)
50
+ ```
51
+
52
+ Although the model was trained with a sequence length of 2048, ALiBi enables users to increase the maximum sequence length during finetuning and/or inference. For example:
53
+
54
+ ```python
55
+ config = transformers.AutoConfig.from_pretrained('mosaicml/mpt-7b-storywriter', trust_remote_code=True)
56
+ config.update({"max_seq_len": 4096})
57
+ model = transformers.AutoModelForCausalLM.from_pretrained('mosaicml/mpt-7b-storywriter', config=config, trust_remote_code=True)
58
+ ```
59
+
60
+ This model was trained with the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer.
61
+
62
+ ```python
63
+ from transformers import AutoTokenizer
64
+ tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
65
+ ```
66
+
67
+ ## Model Description
68
+
69
+ The architecture is a modification of a standard decoder-only transformer.
70
+
71
+ The model has been modified from a standard transformer in the following ways:
72
+ * It uses [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf)
73
+ * It uses [ALiBi (Attention with Linear Biases)](https://arxiv.org/abs/2108.12409) and does not use positional embeddings
74
+ * It does not use biases
75
+
76
+
77
+ | Hyperparameter | Value |
78
+ |----------------|-------|
79
+ |n_parameters | 6.7B |
80
+ |n_layers | 32 |
81
+ | n_heads | 32 |
82
+ | d_model | 4096 |
83
+ | vocab size | 50432 |
84
+ | sequence length | **65536** |
85
+
86
+ ## PreTraining Data
87
+
88
+ For more details on the pretraining process, see [MPT-7B](https://huggingface.co/mosaicml/mpt-7b).
89
+
90
+ The data was tokenized using the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer.
91
+
92
+ ## Limitations and Biases
93
+
94
+ _The following language is modified from [EleutherAI's GPT-NeoX-20B](https://huggingface.co/EleutherAI/gpt-neox-20b)_
95
+
96
+ MPT-7B-Instruct can produce factually incorrect output, and should not be relied on to produce factually accurate information.
97
+ MPT-7B-Instruct was trained on various public datasets.
98
+ While great efforts have been taken to clean the pretraining data, it is possible that this model could generate lewd, biased or otherwise offensive outputs.
99
+
100
+
101
+ ## Acknowledgements
102
+
103
+ This model was finetuned by Alex Trott and the MosaicML NLP team
104
+
105
+ ## Citation
106
+
107
+ Please cite this model using the following format:
108
+
109
+ ```
110
+ @online{MosaicML2023Introducing,
111
+ author = {MosaicML NLP Team},
112
+ title = {Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs},
113
+ year = {2023},
114
+ url = {www.mosaicml.com/blog/mpt-7b},
115
+ note = {Accessed: 2023-03-28}, % change this date
116
+ urldate = {2023-03-28} % change this date
117
+ }
118
+ ```