tiiuae
/

falcon-mamba-7b

@@ -23,10 +23,15 @@ license: apache-2.0
 ## Model Description
-- **Model type:** Language model
-- **Language(s) (NLP):** English
-- **License:** Apache 2.0
 # Usage
@@ -121,17 +126,33 @@ print(tokenizer.decode(outputs[0]))
 </details>
-# Training Details
-Jingwei
 ## Training Data
 Guillaume
 ## Training Procedure
-The model was trained AdamW optimizer, WSD (warmup-stable-decay) learning rate schedule, and a batch size rampup from &nbsp; \\(b_{\mathrm{min}}=128\times2048\\)  &nbsp;  to &nbsp; \\(b_{\mathrm{max}}=2048\times2048\\) &nbsp; tokens during the first 50 GT of the training. In the stable phase, we used maximal learning rate &nbsp; \\(\eta_{\mathrm{max}}=6.4 \times 10^{-4}\\) &nbsp; and decayed it to the minimal value &nbsp; \\(\eta_{\mathrm{min}}=\eta_{\mathrm{max}} / 256\\) &nbsp; with exponential schedule over 500 GT. Also, we applied *BatchScaling* during the rampup — rescaling learning rate &nbsp; \\(\eta\\) &nbsp;so that the Adam noise temperature &nbsp; \\(T_{\mathrm{noise}}\equiv\eta / \sqrt{b}\\) &nbsp; is kept cosntant.
 # Evaluation
@@ -157,3 +178,34 @@ pip install "causal-conv1d>=1.4.0" mamba-ssm
 Refer to our technical report for more details about performance evaluation.

 ## Model Description
+- **Developed by:** [https://www.tii.ae](https://www.tii.ae)
+- **Model type:** Causal decoder-only
+- **Architecture:** Mamba
+- **Language(s) (NLP):** Mainly English
+- **License:** TII Sindibad License 2.0
+### Model Source
+- **Paper:** *coming soon*.
 # Usage
 </details>
+# Training Details
 ## Training Data
 Guillaume
 ## Training Procedure
+Sindibad-7B was trained on 256 H100 80GB GPUs for the majority of the training, using a 3D parallelism strategy (TP=1, PP=1, DP=256) combined with ZeRO.
+#### Training Hyperparameters
+| **Hyperparameter** | **Value**  | **Comment**                               |
+|--------------------|------------|-------------------------------------------|
+| Precision          | `bfloat16` |                                           |
+| Optimizer          | AdamW      |                                           |
+| Max learning rate      | 6.4e-4       | Following a WSD (warmup-stable-decay) learning rate schedule |
+| Weight decay       | 1e-1       |                                           |
+| Z-loss             | 1e-4       |                                           |
+| Batch size         | 2048-4096        |                         |
+The model was trained AdamW optimizer, WSD (warmup-stable-decay) learning rate schedule, and a batch size rampup from \\(b_{\mathrm{min}}=128\\) to \\(b_{\mathrm{max}}=2048\\) during first 50 GT of training. In the stable phase we used maximal learning rate \\(\eta_{\mathrm{max}}=6.4 \times 10^{-4}\\), and decayed it to the minimal value \\(\eta_{\mathrm{min}}=\frac{\eta_{\mathrm{max}}}{256}\\) with exponential schedule over 500 GT. Also, we applied *BatchScaling* during the rampup — rescaling learning rate \\(\eta\\) so that the Adam noise temperature \\(T_{\mathrm{noise}}\equiv\frac{\eta}{\sqrt{b}}\\) is kept constant.
+#### Speeds, Sizes, Times
+The model training took roughly two months.
 # Evaluation
 Refer to our technical report for more details about performance evaluation.
+# Technical Specifications
+## Model Architecture and Objective
+Sindibad-7B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).
+The model is based on the Mamba architecture ([Gu et al., 2023](https://arxiv.org/abs/2312.00752)).
+| **Hyperparameter** | **Value** | **Comment**                            |
+|--------------------|-----------|----------------------------------------|
+| Layers             | 64        |                                        |
+| `d_model`          | 4096      |                                        |
+| `d_state`         | 16       |   The SSM state dimension                                     |
+| Vocabulary         | 65024     |                                        |
+| Sequence length    | 8192      | During stages 4 and LR Decay stage                  |
+## Compute Infrastructure
+### Hardware
+Sindibad-7B was trained on AWS SageMaker, using on average 256 H100 80GB GPUs in 32 p5 instances.
+### Software
+Sindibad-7B was trained an internal distributed training codebase, Gigatron. It uses a 3D parallelism approach combined with ZeRO, high-performance Triton kernels.
+# Citation
+*Paper coming soon* 😊.