JingweiZuo
commited on
Commit
•
9f26abf
1
Parent(s):
e3d99da
chore: update readme.md
Browse files
README.md
CHANGED
@@ -23,10 +23,15 @@ license: apache-2.0
|
|
23 |
|
24 |
## Model Description
|
25 |
|
|
|
|
|
|
|
|
|
|
|
26 |
|
27 |
-
|
28 |
-
|
29 |
-
- **
|
30 |
|
31 |
# Usage
|
32 |
|
@@ -121,17 +126,33 @@ print(tokenizer.decode(outputs[0]))
|
|
121 |
</details>
|
122 |
|
123 |
|
124 |
-
# Training Details
|
125 |
|
126 |
-
|
127 |
|
128 |
## Training Data
|
129 |
|
130 |
Guillaume
|
131 |
|
132 |
## Training Procedure
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
133 |
|
134 |
-
The model was trained AdamW optimizer, WSD (warmup-stable-decay) learning rate schedule, and a batch size rampup from
|
|
|
|
|
|
|
|
|
135 |
|
136 |
# Evaluation
|
137 |
|
@@ -157,3 +178,34 @@ pip install "causal-conv1d>=1.4.0" mamba-ssm
|
|
157 |
|
158 |
Refer to our technical report for more details about performance evaluation.
|
159 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
23 |
|
24 |
## Model Description
|
25 |
|
26 |
+
- **Developed by:** [https://www.tii.ae](https://www.tii.ae)
|
27 |
+
- **Model type:** Causal decoder-only
|
28 |
+
- **Architecture:** Mamba
|
29 |
+
- **Language(s) (NLP):** Mainly English
|
30 |
+
- **License:** TII Sindibad License 2.0
|
31 |
|
32 |
+
### Model Source
|
33 |
+
|
34 |
+
- **Paper:** *coming soon*.
|
35 |
|
36 |
# Usage
|
37 |
|
|
|
126 |
</details>
|
127 |
|
128 |
|
|
|
129 |
|
130 |
+
# Training Details
|
131 |
|
132 |
## Training Data
|
133 |
|
134 |
Guillaume
|
135 |
|
136 |
## Training Procedure
|
137 |
+
Sindibad-7B was trained on 256 H100 80GB GPUs for the majority of the training, using a 3D parallelism strategy (TP=1, PP=1, DP=256) combined with ZeRO.
|
138 |
+
|
139 |
+
#### Training Hyperparameters
|
140 |
+
|
141 |
+
| **Hyperparameter** | **Value** | **Comment** |
|
142 |
+
|--------------------|------------|-------------------------------------------|
|
143 |
+
| Precision | `bfloat16` | |
|
144 |
+
| Optimizer | AdamW | |
|
145 |
+
| Max learning rate | 6.4e-4 | Following a WSD (warmup-stable-decay) learning rate schedule |
|
146 |
+
| Weight decay | 1e-1 | |
|
147 |
+
| Z-loss | 1e-4 | |
|
148 |
+
| Batch size | 2048-4096 | |
|
149 |
+
|
150 |
|
151 |
+
The model was trained AdamW optimizer, WSD (warmup-stable-decay) learning rate schedule, and a batch size rampup from \\(b_{\mathrm{min}}=128\\) to \\(b_{\mathrm{max}}=2048\\) during first 50 GT of training. In the stable phase we used maximal learning rate \\(\eta_{\mathrm{max}}=6.4 \times 10^{-4}\\), and decayed it to the minimal value \\(\eta_{\mathrm{min}}=\frac{\eta_{\mathrm{max}}}{256}\\) with exponential schedule over 500 GT. Also, we applied *BatchScaling* during the rampup — rescaling learning rate \\(\eta\\) so that the Adam noise temperature \\(T_{\mathrm{noise}}\equiv\frac{\eta}{\sqrt{b}}\\) is kept constant.
|
152 |
+
|
153 |
+
#### Speeds, Sizes, Times
|
154 |
+
|
155 |
+
The model training took roughly two months.
|
156 |
|
157 |
# Evaluation
|
158 |
|
|
|
178 |
|
179 |
Refer to our technical report for more details about performance evaluation.
|
180 |
|
181 |
+
|
182 |
+
|
183 |
+
# Technical Specifications
|
184 |
+
|
185 |
+
## Model Architecture and Objective
|
186 |
+
|
187 |
+
Sindibad-7B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).
|
188 |
+
|
189 |
+
The model is based on the Mamba architecture ([Gu et al., 2023](https://arxiv.org/abs/2312.00752)).
|
190 |
+
|
191 |
+
| **Hyperparameter** | **Value** | **Comment** |
|
192 |
+
|--------------------|-----------|----------------------------------------|
|
193 |
+
| Layers | 64 | |
|
194 |
+
| `d_model` | 4096 | |
|
195 |
+
| `d_state` | 16 | The SSM state dimension |
|
196 |
+
| Vocabulary | 65024 | |
|
197 |
+
| Sequence length | 8192 | During stages 4 and LR Decay stage |
|
198 |
+
|
199 |
+
## Compute Infrastructure
|
200 |
+
|
201 |
+
### Hardware
|
202 |
+
|
203 |
+
Sindibad-7B was trained on AWS SageMaker, using on average 256 H100 80GB GPUs in 32 p5 instances.
|
204 |
+
|
205 |
+
### Software
|
206 |
+
|
207 |
+
Sindibad-7B was trained an internal distributed training codebase, Gigatron. It uses a 3D parallelism approach combined with ZeRO, high-performance Triton kernels.
|
208 |
+
|
209 |
+
# Citation
|
210 |
+
|
211 |
+
*Paper coming soon* 😊.
|