KnutJaegersberg commited on
Commit
084fe05
1 Parent(s): 79f73c7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +239 -0
README.md CHANGED
@@ -1,3 +1,242 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
+ inference: false
5
+ tags:
6
+ - pytorch
7
+ - causal-lm
8
+ - Cerebras
9
+ - BTLM
10
+ datasets:
11
+ - cerebras/SlimPajama-627B
12
+ pipeline_tag: text-generation
13
  license: apache-2.0
14
  ---
15
+
16
+ # BTLM-3B-8k-base
17
+
18
+ [Bittensor Language Model (BTLM-3B-8k-base)](https://www.cerebras.net/blog/btlm-3b-8k-7b-performance-in-a-3-billion-parameter-model/) is a 3 billion parameter language model with an 8k context length trained on 627B tokens of [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B). BTLM-3B-8k-base sets a new standard for 3B parameter models, outperforming models trained on hundreds of billions more tokens and achieving comparable performance to open 7B parameter models. BTLM-3B-8k-base can also be quantized to 4-bit to fit in devices with as little as 3GB of memory. The model is made available with an Apache 2.0 license for commercial use.
19
+
20
+ BTLM was trained by [Cerebras](https://www.cerebras.net/) in partnership with [Opentensor](https://opentensor.ai/) on the newly unveiled [Condor Galaxy 1 (CG-1) supercomputer](https://www.cerebras.net/blog/introducing-condor-galaxy-1-a-4-exaflop-supercomputer-for-generative-ai/), the first public deliverable of the G42-Cerebras strategic partnership.
21
+
22
+ BTLM-3B-8k was trained with a similar architecture to [CerebrasGPT](https://arxiv.org/abs/2304.03208) with the addition of [SwiGLU](https://arxiv.org/abs/2002.05202) nonlinearity, [ALiBi](https://arxiv.org/abs/2108.12409) position embeddings, and [maximal update parameterization (muP)](https://arxiv.org/abs/2203.03466). The model was trained for 1 epoch of SlimPajama-627B. 75% of training was performed with 2k sequence length. The final 25% of training was performed at 8k sequence length to enable long sequence applications
23
+
24
+ Read [our paper](https://arxiv.org/abs/2309.11568) for more details!
25
+
26
+ ## BTLM-3B-8k Highlights
27
+
28
+ BTLM-3B-8k-base:
29
+ - **Licensed for commercial use** (Apache 2.0).
30
+ - **[State of the art 3B parameter model](#performance-vs-3b-models)**.
31
+ - **Provides 7B model performance in a 3B model** via performance enhancements from [ALiBi](https://arxiv.org/abs/2108.12409), [SwiGLU](https://arxiv.org/abs/2002.05202), [maximal update parameterization (muP)](https://arxiv.org/abs/2203.03466) and the the extensively deduplicated and cleaned [SlimPajama-627B dataset](https://huggingface.co/datasets/cerebras/SlimPajama-627B).
32
+ - **[Fits in devices with as little as 3GB of memory](#memory-requirements) when quantized to 4-bit**.
33
+ - **One of few 3B models that supports 8k sequence length** thanks to ALiBi.
34
+ - **Requires 71% fewer training FLOPs, has 58% smaller memory footprint** for inference than comparable 7B models.
35
+
36
+ ## Usage
37
+ *Note: Transformers does not support muP for all models, so BTLM-3B-8k-base requires a custom model class. This causes a situation where users must either (1) enable `trust_remote_code=True` when loading the model or (2) acknowledge the warning about code execution upon loading the model.*
38
+
39
+ #### With generate():
40
+ ```python
41
+ from transformers import AutoTokenizer, AutoModelForCausalLM
42
+
43
+ # Load the tokenizer and model
44
+ tokenizer = AutoTokenizer.from_pretrained("cerebras/btlm-3b-8k-base")
45
+ model = AutoModelForCausalLM.from_pretrained("cerebras/btlm-3b-8k-base", trust_remote_code=True, torch_dtype="auto")
46
+
47
+ # Set the prompt for generating text
48
+ prompt = "Albert Einstein was known for "
49
+
50
+ # Tokenize the prompt and convert to PyTorch tensors
51
+ inputs = tokenizer(prompt, return_tensors="pt")
52
+
53
+ # Generate text using the model
54
+ outputs = model.generate(
55
+ **inputs,
56
+ num_beams=5,
57
+ max_new_tokens=50,
58
+ early_stopping=True,
59
+ no_repeat_ngram_size=2
60
+ )
61
+
62
+ # Convert the generated token IDs back to text
63
+ generated_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)
64
+
65
+ # Print the generated text
66
+ print(generated_text[0])
67
+ ```
68
+
69
+ #### With pipeline:
70
+ ```python
71
+ from transformers import AutoTokenizer, AutoModelForCausalLM
72
+ from transformers import pipeline
73
+
74
+ # Load the tokenizer and model
75
+ tokenizer = AutoTokenizer.from_pretrained("cerebras/btlm-3b-8k-base")
76
+ model = AutoModelForCausalLM.from_pretrained("cerebras/btlm-3b-8k-base", trust_remote_code=True, torch_dtype="auto")
77
+
78
+ # Set the prompt for text generation
79
+ prompt = """Isaac Newton was a """
80
+
81
+ # Create a text generation pipeline
82
+ pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
83
+
84
+ # Generate text using the pipeline
85
+ generated_text = pipe(
86
+ prompt,
87
+ max_length=50,
88
+ do_sample=False,
89
+ no_repeat_ngram_size=2)[0]
90
+
91
+ # Print the generated text
92
+ print(generated_text['generated_text'])
93
+ ```
94
+
95
+ ## Evaluations and Comparisons to Other Models
96
+
97
+ ### Memory Requirements
98
+ ![figure_1_image](./figure_1_memory_footprint.png)
99
+ Figure 1. Memory requirements of different model sizes and quantization schemes
100
+
101
+ ### Quality, Training Cost, Memory Footprint, Inference Speed
102
+ ![figure_2_image](./figure_2_half_the_size_twice_the_speed.png)
103
+ Figure 2: Comparisons of quality, memory footprint & inference cost between BTLM-3B-8K and 7B model families.
104
+
105
+ ### Performance vs 3B models
106
+ ![table_1_image](./table_1_downstream_performance_3b.png)
107
+ Table 1: Performance at 3B model size. Detailed down-stream tasks comparisons. MMLU task performance is reported using 5-shot, other tasks are 0-shot.
108
+
109
+ ![figure_3_image](./figure_3_performance_vs_3b_models.png)
110
+ Figure 3: Performance at 3B model size
111
+
112
+ ### Performance vs 7B models
113
+ ![table_2_image](./table_2_downstream_performance_7b.png)
114
+ Table 2: Performance at 7B model size. Detailed down-stream tasks comparisons. MMLU task performance is reported using 5-shot, everything else is 0-shot.
115
+
116
+ ![figure_4_image](./figure_4_performance_vs_7b_models.jpg)
117
+ Figure 4: Performance at 7B model size
118
+
119
+ ## Long Sequence Lengths
120
+ To enable long sequence applications, we use ALiBi position embeddings and trained on 470B tokens at the context length of 2,048 followed by 157B of tokens trained at 8,192 context length. To assess BTLM’s long sequence capability, we evaluate it on SlimPajama test set with 32,768 context length and plot loss at each token position. Although ALiBi allows extrapolation in theory, 2,048 context length training alone does not extrapolate well in practice. Thankfully variable sequence length training allows for substantially improved extrapolation. BTLM-3B extrapolates well up to 10k context length but the performance degrades slightly beyond this.
121
+
122
+ ![figure_5_image](./figure_5_xentropy_with_sequence_lengths.svg)
123
+ Figure 5: BTLM-3B model's cross-entropy evaluation on the SlimPajama’s test set. Inference performed on the extrapolated sequence length of 32,768 tokens.
124
+
125
+ ## Model Details
126
+ - Developed by: [Cerebras Systems](https://www.cerebras.net/) and [Opentensor](https://opentensor.ai/) with generous support from [G42 Cloud](https://www.g42cloud.com/) and [IIAI](https://www.inceptioniai.org/en/)
127
+ - License: Apache 2.0
128
+ - Model type: Decoder-only Language Model
129
+ - Architecture: GPT-2 style architecture with SwiGLU, ALiBi, and muP
130
+ - Data set: SlimPajama-627B
131
+ - Tokenizer: Byte Pair Encoding
132
+ - Vocabulary Size: 50257
133
+ - Sequence Length: 8192
134
+ - Optimizer: AdamW
135
+ - Positional Encoding: ALiBi
136
+ - Language: English
137
+ - Learn more: [BTLM-3B-8k blog](https://www.cerebras.net/blog/btlm-3b-8k-7b-performance-in-a-3-billion-parameter-model/)
138
+ - Paper: [BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model](https://arxiv.org/abs/2309.11568)
139
+
140
+ ## To continue training with PyTorch and Maximal Update Parameterization
141
+
142
+ ```python
143
+ from transformers import AutoModelForCausalLM
144
+ import torch
145
+
146
+ model = AutoModelForCausalLM.from_pretrained("cerebras/btlm-3b-8k-base", trust_remote_code=True)
147
+
148
+ # Get the parameter groups for the muP optimizer
149
+ param_groups = model.get_mup_param_groups(lr=1e-3, weight_decay=0.1)
150
+
151
+ # Set up the optimizer using AdamW with muP parameters
152
+ optimizer = torch.optim.AdamW(
153
+ param_groups,
154
+ betas=(0.9, 0.95),
155
+ eps=1e-8
156
+ )
157
+ ```
158
+
159
+ Ensure the following muP parameters are passed in your config, otherwise your model will default to standard parameterization
160
+ - `mup_width_scale: <float>`
161
+ - `mup_embeddings_scale: <float>`
162
+ - `mup_output_alpha: <float>`
163
+ - `mup_scale_qk_dot_by_d: true`
164
+
165
+ ## To extend the context length with Position Interpolation
166
+
167
+ ### During inference (without fine-tuning):
168
+ It's possible to extend the context length to 2x the training context length without degradation in performance using dynamic linear scaling. Dynamic linear scaling adjusts the slopes of ALiBi with a factor of `input_seq_len/train_seq_len` when `input_seq_len` is larger than `train_seq_len`. Check the details in our paper [Position Interpolation Improves ALiBi Extrapolation](https://arxiv.org/abs/2310.13017). To enable dynamic linear scaling, update `config.json` as follows:
169
+ ```json
170
+ # update `n_positions` with the maximum context length will be
171
+ # encountered during inference (e.g. 16384 tokens)
172
+ "n_positions": 16384,
173
+
174
+ # specify `train_seq_len` in `alibi_scaling` parameter
175
+ "alibi_scaling": {
176
+ "type": "linear",
177
+ "train_seq_len": 8192
178
+ }
179
+ ```
180
+
181
+ ### Using fine-tuning + position interpolation:
182
+ Performing fine-tuning with position interpolation can help achieve greater extrapolation lengths. The scaling factor should be fixed to `finetuning_seq_len/train_seq_len`. To enable fixed linear scaling, update `config.json` as follows:
183
+ ```json
184
+ # update `n_positions` with the fine-tuning context length (e.g. 32768 tokens)
185
+ "n_positions": 32768,
186
+
187
+ # specify the scaling `factor` in `alibi_scaling` parameter
188
+ "alibi_scaling": {
189
+ "type": "linear",
190
+ "factor": 4.0
191
+ }
192
+ ```
193
+
194
+ ## Uses and Limitations
195
+
196
+ ### Intended Use
197
+ The primary intended use is to further research into large language models. BTLM-3B-8k-base can be used as a foundation model for NLP, applications, ethics, and alignment research. We release these models with a fully permissive Apache license for the community to use freely.
198
+
199
+ You may fine-tune and adapt BTLM-3B-8k-base model via either Cerebras [Model Studio](https://www.cerebras.net/product-cloud/) or third-party libraries. Further safety-related testing and mitigations should be applied before using the BTLM-3B-8k-base in production downstream applications.
200
+
201
+ ### Out of Scope Use
202
+ BTLM-3B-8k-base was trained on SlimPajama, with primarily English language, and is not recommended for machine translation tasks. BTLM-3B-8k-base has not been tuned for instruction-following or chat-based use cases.
203
+
204
+ ### Generation Samples
205
+ ```
206
+ Prompt: What is a manometer?"
207
+ Generation: "A manometric device is an instrument that measures the pressure of a fluid."
208
+ ```
209
+
210
+ ```
211
+ Prompt: "What Robert Oppenheimer is famous for?"
212
+ Generation: “Robert Opperman was a physicist who was the director of the Manhattan Project, which was responsible for the development of nuclear weapons.”
213
+ ```
214
+
215
+ ```
216
+ Prompt: "What was Albert Einstein known for?"
217
+ Generation: “Albert Einstein was a German-born theoretical physicist who developed the theory of relativity, one of the two pillars of modern physics (alongside quantum mechanics).”
218
+ ```
219
+
220
+ ```
221
+ Prompt: “Who is Isaac Newton?”
222
+ Generation: “Isaac Newton was an English physicist, mathematician, astronomer, alchemist, and theologian. He is widely recognized as one of the most influential scientists of all time and as a key figure in the scientific revolution.”
223
+ ```
224
+
225
+ ```
226
+ Prompt: "Where does the blood get pumped to after it leaves the right ventricle?"
227
+ Generation: “The right atrium receives the deoxygenated blood from the pulmonary artery. The right side of the heart is the only side that receives de-oxygened...”
228
+ ```
229
+
230
+ ```
231
+ Prompt: "Peter lost his wallet after attending a local concert. To help him, Bob gave David 10 dollars. David then passed it on to Peter. Now what is the total amount of money Peter has?"
232
+ Generation: “A. $10”
233
+ ```
234
+
235
+ ## Risk, Bias, Ethical Considerations
236
+ - **Human life:** The outputs from this model may or may not align with human values. The risk needs to be thoroughly investigated before deploying this model in a production environment where it can directly impact human life.
237
+ - **Risks and harms:** There may be distributional bias in the [RedPajama dataset](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) that can manifest in various forms in the downstream model deployment. There are other risks associated with large language models such as amplifying stereotypes, memorizing training data, or revealing private or secure information.
238
+
239
+ ## Acknowledgements
240
+ We are thankful to all Cerebras engineers that made this work possible.
241
+
242
+ We would like to acknowledge the generous support of G42 Cloud and the Inception Institute of Artificial Intelligence for providing compute time on Condor Galaxy 1.