Text Generation
Transformers
Safetensors
English
falcon_mamba
Eval Results
Inference Endpoints
Ilyas Chahed commited on
Commit
d3fe549
1 Parent(s): adf712d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -179,9 +179,9 @@ Falcon-Mamba-7B was trained on 256 H100 80GB GPUs for the majority of the traini
179
  |--------------------|------------|-------------------------------------------|
180
  | Precision | `bfloat16` | |
181
  | Optimizer | AdamW | |
182
- | Max learning rate | 6.4e-4 | Following a WSD (warmup-stable-decay) learning rate schedule |
183
  | Weight decay | 1e-1 | |
184
- | Batch size | 2048 | |
185
 
186
 
187
  The model was trained AdamW optimizer, WSD (warmup-stable-decay) learning rate schedule, and a batch size rampup from \\(b_{\mathrm{min}}=128\\) to \\(b_{\mathrm{max}}=2048\\) during first 50 GT of training. In the stable phase we used maximal learning rate \\(\eta_{\mathrm{max}}=6.4 \times 10^{-4}\\), and decayed it to the minimal value \\(\eta_{\mathrm{min}}=\frac{\eta_{\mathrm{max}}}{256}\\) with exponential schedule over 500 GT. Also, we applied *BatchScaling* during the rampup — rescaling learning rate \\(\eta\\) so that the Adam noise temperature \\(T_{\mathrm{noise}}\equiv\frac{\eta}{\sqrt{b}}\\) is kept constant.
@@ -229,9 +229,9 @@ The model is based on the Mamba architecture ([Gu et al., 2023](https://arxiv.or
229
  |--------------------|-----------|----------------------------------------|
230
  | Layers | 64 | |
231
  | `d_model` | 4096 | |
232
- | `d_state` | 16 | The SSM state dimension |
233
  | Vocabulary | 65024 | |
234
- | Sequence length | 8192 | During stages 4 and LR Decay stage |
235
 
236
  ## Compute Infrastructure
237
 
 
179
  |--------------------|------------|-------------------------------------------|
180
  | Precision | `bfloat16` | |
181
  | Optimizer | AdamW | |
182
+ | Max learning rate | 6.4e-4 | Following a WSD (warmup-stable-decay) learning rate schedule |
183
  | Weight decay | 1e-1 | |
184
+ | Batch size | 2048 | |
185
 
186
 
187
  The model was trained AdamW optimizer, WSD (warmup-stable-decay) learning rate schedule, and a batch size rampup from \\(b_{\mathrm{min}}=128\\) to \\(b_{\mathrm{max}}=2048\\) during first 50 GT of training. In the stable phase we used maximal learning rate \\(\eta_{\mathrm{max}}=6.4 \times 10^{-4}\\), and decayed it to the minimal value \\(\eta_{\mathrm{min}}=\frac{\eta_{\mathrm{max}}}{256}\\) with exponential schedule over 500 GT. Also, we applied *BatchScaling* during the rampup — rescaling learning rate \\(\eta\\) so that the Adam noise temperature \\(T_{\mathrm{noise}}\equiv\frac{\eta}{\sqrt{b}}\\) is kept constant.
 
229
  |--------------------|-----------|----------------------------------------|
230
  | Layers | 64 | |
231
  | `d_model` | 4096 | |
232
+ | `d_state` | 16 | The SSM state dimension |
233
  | Vocabulary | 65024 | |
234
+ | Sequence length | 8192 | During stages 4 and LR Decay stage |
235
 
236
  ## Compute Infrastructure
237