tiiuae
/

falcon-mamba-7b

@@ -179,9 +179,9 @@ Falcon-Mamba-7B was trained on 256 H100 80GB GPUs for the majority of the traini
 |--------------------|------------|-------------------------------------------|
 | Precision          | `bfloat16` |                                           |
 | Optimizer          | AdamW      |                                           |
-| Max learning rate      | 6.4e-4       | Following a WSD (warmup-stable-decay) learning rate schedule |
 | Weight decay       | 1e-1       |                                           |
-| Batch size         | 2048        |                         |
 The model was trained AdamW optimizer, WSD (warmup-stable-decay) learning rate schedule, and a batch size rampup from \\(b_{\mathrm{min}}=128\\) to \\(b_{\mathrm{max}}=2048\\) during first 50 GT of training. In the stable phase we used maximal learning rate \\(\eta_{\mathrm{max}}=6.4 \times 10^{-4}\\), and decayed it to the minimal value \\(\eta_{\mathrm{min}}=\frac{\eta_{\mathrm{max}}}{256}\\) with exponential schedule over 500 GT. Also, we applied *BatchScaling* during the rampup — rescaling learning rate \\(\eta\\) so that the Adam noise temperature \\(T_{\mathrm{noise}}\equiv\frac{\eta}{\sqrt{b}}\\) is kept constant.
@@ -229,9 +229,9 @@ The model is based on the Mamba architecture ([Gu et al., 2023](https://arxiv.or
 |--------------------|-----------|----------------------------------------|
 | Layers             | 64        |                                        |
 | `d_model`          | 4096      |                                        |
-| `d_state`         | 16       |   The SSM state dimension                                     |
 | Vocabulary         | 65024     |                                        |
-| Sequence length    | 8192      | During stages 4 and LR Decay stage                  |
 ## Compute Infrastructure

 |--------------------|------------|-------------------------------------------|
 | Precision          | `bfloat16` |                                           |
 | Optimizer          | AdamW      |                                           |
+| Max learning rate  | 6.4e-4     | Following a WSD (warmup-stable-decay) learning rate schedule |
 | Weight decay       | 1e-1       |                                           |
+| Batch size         | 2048       |                                           |
 The model was trained AdamW optimizer, WSD (warmup-stable-decay) learning rate schedule, and a batch size rampup from \\(b_{\mathrm{min}}=128\\) to \\(b_{\mathrm{max}}=2048\\) during first 50 GT of training. In the stable phase we used maximal learning rate \\(\eta_{\mathrm{max}}=6.4 \times 10^{-4}\\), and decayed it to the minimal value \\(\eta_{\mathrm{min}}=\frac{\eta_{\mathrm{max}}}{256}\\) with exponential schedule over 500 GT. Also, we applied *BatchScaling* during the rampup — rescaling learning rate \\(\eta\\) so that the Adam noise temperature \\(T_{\mathrm{noise}}\equiv\frac{\eta}{\sqrt{b}}\\) is kept constant.
 |--------------------|-----------|----------------------------------------|
 | Layers             | 64        |                                        |
 | `d_model`          | 4096      |                                        |
+| `d_state`          | 16        | The SSM state dimension                |
 | Vocabulary         | 65024     |                                        |
+| Sequence length    | 8192      | During stages 4 and LR Decay stage     |
 ## Compute Infrastructure