Ilyas Chahed
commited on
Commit
•
d3fe549
1
Parent(s):
adf712d
Update README.md
Browse files
README.md
CHANGED
@@ -179,9 +179,9 @@ Falcon-Mamba-7B was trained on 256 H100 80GB GPUs for the majority of the traini
|
|
179 |
|--------------------|------------|-------------------------------------------|
|
180 |
| Precision | `bfloat16` | |
|
181 |
| Optimizer | AdamW | |
|
182 |
-
| Max learning rate
|
183 |
| Weight decay | 1e-1 | |
|
184 |
-
| Batch size | 2048
|
185 |
|
186 |
|
187 |
The model was trained AdamW optimizer, WSD (warmup-stable-decay) learning rate schedule, and a batch size rampup from \\(b_{\mathrm{min}}=128\\) to \\(b_{\mathrm{max}}=2048\\) during first 50 GT of training. In the stable phase we used maximal learning rate \\(\eta_{\mathrm{max}}=6.4 \times 10^{-4}\\), and decayed it to the minimal value \\(\eta_{\mathrm{min}}=\frac{\eta_{\mathrm{max}}}{256}\\) with exponential schedule over 500 GT. Also, we applied *BatchScaling* during the rampup — rescaling learning rate \\(\eta\\) so that the Adam noise temperature \\(T_{\mathrm{noise}}\equiv\frac{\eta}{\sqrt{b}}\\) is kept constant.
|
@@ -229,9 +229,9 @@ The model is based on the Mamba architecture ([Gu et al., 2023](https://arxiv.or
|
|
229 |
|--------------------|-----------|----------------------------------------|
|
230 |
| Layers | 64 | |
|
231 |
| `d_model` | 4096 | |
|
232 |
-
| `d_state`
|
233 |
| Vocabulary | 65024 | |
|
234 |
-
| Sequence length | 8192 | During stages 4 and LR Decay stage
|
235 |
|
236 |
## Compute Infrastructure
|
237 |
|
|
|
179 |
|--------------------|------------|-------------------------------------------|
|
180 |
| Precision | `bfloat16` | |
|
181 |
| Optimizer | AdamW | |
|
182 |
+
| Max learning rate | 6.4e-4 | Following a WSD (warmup-stable-decay) learning rate schedule |
|
183 |
| Weight decay | 1e-1 | |
|
184 |
+
| Batch size | 2048 | |
|
185 |
|
186 |
|
187 |
The model was trained AdamW optimizer, WSD (warmup-stable-decay) learning rate schedule, and a batch size rampup from \\(b_{\mathrm{min}}=128\\) to \\(b_{\mathrm{max}}=2048\\) during first 50 GT of training. In the stable phase we used maximal learning rate \\(\eta_{\mathrm{max}}=6.4 \times 10^{-4}\\), and decayed it to the minimal value \\(\eta_{\mathrm{min}}=\frac{\eta_{\mathrm{max}}}{256}\\) with exponential schedule over 500 GT. Also, we applied *BatchScaling* during the rampup — rescaling learning rate \\(\eta\\) so that the Adam noise temperature \\(T_{\mathrm{noise}}\equiv\frac{\eta}{\sqrt{b}}\\) is kept constant.
|
|
|
229 |
|--------------------|-----------|----------------------------------------|
|
230 |
| Layers | 64 | |
|
231 |
| `d_model` | 4096 | |
|
232 |
+
| `d_state` | 16 | The SSM state dimension |
|
233 |
| Vocabulary | 65024 | |
|
234 |
+
| Sequence length | 8192 | During stages 4 and LR Decay stage |
|
235 |
|
236 |
## Compute Infrastructure
|
237 |
|