Text Generation
Transformers
Safetensors
English
falcon_mamba
Eval Results
Inference Endpoints
JingweiZuo commited on
Commit
9f26abf
•
1 Parent(s): e3d99da

chore: update readme.md

Browse files
Files changed (1) hide show
  1. README.md +58 -6
README.md CHANGED
@@ -23,10 +23,15 @@ license: apache-2.0
23
 
24
  ## Model Description
25
 
 
 
 
 
 
26
 
27
- - **Model type:** Language model
28
- - **Language(s) (NLP):** English
29
- - **License:** Apache 2.0
30
 
31
  # Usage
32
 
@@ -121,17 +126,33 @@ print(tokenizer.decode(outputs[0]))
121
  </details>
122
 
123
 
124
- # Training Details
125
 
126
- Jingwei
127
 
128
  ## Training Data
129
 
130
  Guillaume
131
 
132
  ## Training Procedure
 
 
 
 
 
 
 
 
 
 
 
 
 
133
 
134
- The model was trained AdamW optimizer, WSD (warmup-stable-decay) learning rate schedule, and a batch size rampup from &nbsp; \\(b_{\mathrm{min}}=128\times2048\\) &nbsp; to &nbsp; \\(b_{\mathrm{max}}=2048\times2048\\) &nbsp; tokens during the first 50 GT of the training. In the stable phase, we used maximal learning rate &nbsp; \\(\eta_{\mathrm{max}}=6.4 \times 10^{-4}\\) &nbsp; and decayed it to the minimal value &nbsp; \\(\eta_{\mathrm{min}}=\eta_{\mathrm{max}} / 256\\) &nbsp; with exponential schedule over 500 GT. Also, we applied *BatchScaling* during the rampup — rescaling learning rate &nbsp; \\(\eta\\) &nbsp;so that the Adam noise temperature &nbsp; \\(T_{\mathrm{noise}}\equiv\eta / \sqrt{b}\\) &nbsp; is kept cosntant.
 
 
 
 
135
 
136
  # Evaluation
137
 
@@ -157,3 +178,34 @@ pip install "causal-conv1d>=1.4.0" mamba-ssm
157
 
158
  Refer to our technical report for more details about performance evaluation.
159
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
  ## Model Description
25
 
26
+ - **Developed by:** [https://www.tii.ae](https://www.tii.ae)
27
+ - **Model type:** Causal decoder-only
28
+ - **Architecture:** Mamba
29
+ - **Language(s) (NLP):** Mainly English
30
+ - **License:** TII Sindibad License 2.0
31
 
32
+ ### Model Source
33
+
34
+ - **Paper:** *coming soon*.
35
 
36
  # Usage
37
 
 
126
  </details>
127
 
128
 
 
129
 
130
+ # Training Details
131
 
132
  ## Training Data
133
 
134
  Guillaume
135
 
136
  ## Training Procedure
137
+ Sindibad-7B was trained on 256 H100 80GB GPUs for the majority of the training, using a 3D parallelism strategy (TP=1, PP=1, DP=256) combined with ZeRO.
138
+
139
+ #### Training Hyperparameters
140
+
141
+ | **Hyperparameter** | **Value** | **Comment** |
142
+ |--------------------|------------|-------------------------------------------|
143
+ | Precision | `bfloat16` | |
144
+ | Optimizer | AdamW | |
145
+ | Max learning rate | 6.4e-4 | Following a WSD (warmup-stable-decay) learning rate schedule |
146
+ | Weight decay | 1e-1 | |
147
+ | Z-loss | 1e-4 | |
148
+ | Batch size | 2048-4096 | |
149
+
150
 
151
+ The model was trained AdamW optimizer, WSD (warmup-stable-decay) learning rate schedule, and a batch size rampup from \\(b_{\mathrm{min}}=128\\) to \\(b_{\mathrm{max}}=2048\\) during first 50 GT of training. In the stable phase we used maximal learning rate \\(\eta_{\mathrm{max}}=6.4 \times 10^{-4}\\), and decayed it to the minimal value \\(\eta_{\mathrm{min}}=\frac{\eta_{\mathrm{max}}}{256}\\) with exponential schedule over 500 GT. Also, we applied *BatchScaling* during the rampup — rescaling learning rate \\(\eta\\) so that the Adam noise temperature \\(T_{\mathrm{noise}}\equiv\frac{\eta}{\sqrt{b}}\\) is kept constant.
152
+
153
+ #### Speeds, Sizes, Times
154
+
155
+ The model training took roughly two months.
156
 
157
  # Evaluation
158
 
 
178
 
179
  Refer to our technical report for more details about performance evaluation.
180
 
181
+
182
+
183
+ # Technical Specifications
184
+
185
+ ## Model Architecture and Objective
186
+
187
+ Sindibad-7B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).
188
+
189
+ The model is based on the Mamba architecture ([Gu et al., 2023](https://arxiv.org/abs/2312.00752)).
190
+
191
+ | **Hyperparameter** | **Value** | **Comment** |
192
+ |--------------------|-----------|----------------------------------------|
193
+ | Layers | 64 | |
194
+ | `d_model` | 4096 | |
195
+ | `d_state` | 16 | The SSM state dimension |
196
+ | Vocabulary | 65024 | |
197
+ | Sequence length | 8192 | During stages 4 and LR Decay stage |
198
+
199
+ ## Compute Infrastructure
200
+
201
+ ### Hardware
202
+
203
+ Sindibad-7B was trained on AWS SageMaker, using on average 256 H100 80GB GPUs in 32 p5 instances.
204
+
205
+ ### Software
206
+
207
+ Sindibad-7B was trained an internal distributed training codebase, Gigatron. It uses a 3D parallelism approach combined with ZeRO, high-performance Triton kernels.
208
+
209
+ # Citation
210
+
211
+ *Paper coming soon* 😊.