slippylolo commited on
Commit
cca0510
1 Parent(s): 831952d

Update model architecture

Browse files
Files changed (1) hide show
  1. README.md +20 -7
README.md CHANGED
@@ -65,7 +65,7 @@ for seq in sequences:
65
 
66
  ### Direct Use
67
 
68
- Research on large language models, and the influence of adequately filtered and deduplicated web data on the properties of large language models (fairness, safety, limitations, capabilities, etc.).
69
 
70
  ### Out-of-Scope Use
71
 
@@ -127,13 +127,16 @@ Falcon-RW-1B was trained on 32 A100 40GB GPUs, using only data parallelism with
127
 
128
  Hyperparameters were adapted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)).
129
 
130
- - **Precision:** bf16;
131
- - **Optimizer:** Adam;
132
- - **Learning rate:** 2e-4 (500M tokens warm-up, followed by cosine decay to 2e-5);
133
- - **Weight decay:** 0.1;
134
- - **Batch size:** 512 (with a 4B tokens ramp-up).
 
 
135
 
136
- #### Speeds, Sizes, Times [optional]
 
137
 
138
  Training happened in early December 2022 and took about six days.
139
 
@@ -149,6 +152,16 @@ Training happened in early December 2022 and took about six days.
149
 
150
  Falcon-RW-1B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).
151
 
 
 
 
 
 
 
 
 
 
 
152
  ### Compute Infrastructure
153
 
154
  #### Hardware
 
65
 
66
  ### Direct Use
67
 
68
+ Research on large language models, specifically the influence of adequately filtered and deduplicated web data on the properties of large language models (fairness, safety, limitations, capabilities, etc.).
69
 
70
  ### Out-of-Scope Use
71
 
 
127
 
128
  Hyperparameters were adapted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)).
129
 
130
+ | **Hyperparameter** | **Value** | **Comment** |
131
+ |--------------------|------------|-------------------------------------------|
132
+ | Precision | `bfloat16` | |
133
+ | Optimizer | AdamW | |
134
+ | Learning rate | 2e-4 | 500M tokens warm-up, cosine decay to 2e-5 |
135
+ | Weight decay | 1e-1 | |
136
+ | Batch size | 512 | 4B tokens ramp-up |
137
 
138
+
139
+ #### Speeds, Sizes, Times
140
 
141
  Training happened in early December 2022 and took about six days.
142
 
 
152
 
153
  Falcon-RW-1B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).
154
 
155
+ The architecture is adapted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)), but uses ALiBi ([Ofir et al., 2021](https://arxiv.org/abs/2108.12409)) and FlashAttention ([Dao et al., 2022](https://arxiv.org/abs/2205.14135)).
156
+
157
+ | **Hyperparameter** | **Value** | **Comment** |
158
+ |--------------------|-----------|----------------------------------------|
159
+ | Layers | 24 | |
160
+ | `d_model` | 2048 | |
161
+ | `head_dim` | 64 | Reduced to optimise for FlashAttention |
162
+ | Vocabulary | 50304 | |
163
+ | Sequence length | 2048 | |
164
+
165
  ### Compute Infrastructure
166
 
167
  #### Hardware