meg HF staff commited on
Commit
2e65771
1 Parent(s): 5c72534

Attempts to fill out the 1B3 model details that diverge from the main one.

Browse files

Here I am using: https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/smaller_models/tr11b-1B3-ml.slurm to help flesh it out.

Files changed (1) hide show
  1. README.md +16 -21
README.md CHANGED
@@ -120,11 +120,11 @@ Please see [the BLOOM training README](https://github.com/bigscience-workshop/bi
120
 
121
  * ALiBI positional encodings (see [paper](https://arxiv.org/pdf/2108.12409.pdf)), with GeLU activation functions
122
 
123
- * 176 billion parameters:
124
 
125
- * 70 layers, 112 attention heads
126
 
127
- * Hidden layers are 14336-dimensional
128
 
129
  * Sequence length of 2048 tokens used (see [BLOOM tokenizer](https://huggingface.co/bigscience/tokenizer), [tokenizer description](#tokenization))
130
 
@@ -132,12 +132,14 @@ Please see [the BLOOM training README](https://github.com/bigscience-workshop/bi
132
 
133
  **Compute infrastructure:** Jean Zay Public Supercomputer, provided by the French government (see [announcement](https://www.enseignementsup-recherche.gouv.fr/fr/signature-du-marche-d-acquisition-de-l-un-des-supercalculateurs-les-plus-puissants-d-europe-46733)).
134
 
135
- * Hardware: 384 A100 80GB GPUs (48 nodes):
136
-
137
- * Additional 32 A100 80GB GPUs (4 nodes) in reserve
138
-
139
- * 8 GPUs per node Using NVLink 4 inter-gpu connects, 4 OmniPath links
140
 
 
 
 
 
 
 
141
  * CPU: AMD
142
 
143
  * CPU memory: 512GB per node
@@ -163,28 +165,21 @@ Please see [the BLOOM training README](https://github.com/bigscience-workshop/bi
163
 
164
  #### **Training**
165
 
166
-
167
- _In progress._
168
-
169
- Current training logs: [Tensorboard link](https://huggingface.co/tensorboard/bigscience/tr11-176B-ml-logs/)
170
-
171
  - Checkpoint size:
172
 
173
- - Bf16 weights: 329GB
174
 
175
- - Full checkpoint with optimizer states: 2.3TB
176
 
177
- - Training throughput: About 150 TFLOP per GPU per second
178
 
179
- - Number of epochs: 1 (*current target*)
180
 
181
  - Dates:
182
 
183
- - Started 11th March, 2022 11:42am PST
184
-
185
- - Estimated end: 5th July, 2022
186
 
187
- - Estimated cost of training: Equivalent of $2-5M in cloud computing (including preliminary experiments)
188
 
189
  - Server training location: Île-de-France, France
190
 
120
 
121
  * ALiBI positional encodings (see [paper](https://arxiv.org/pdf/2108.12409.pdf)), with GeLU activation functions
122
 
123
+ * 1.3 billion parameters:
124
 
125
+ * 24 layers, 16 attention heads
126
 
127
+ * Hidden layers are 2048-dimensional
128
 
129
  * Sequence length of 2048 tokens used (see [BLOOM tokenizer](https://huggingface.co/bigscience/tokenizer), [tokenizer description](#tokenization))
130
 
132
 
133
  **Compute infrastructure:** Jean Zay Public Supercomputer, provided by the French government (see [announcement](https://www.enseignementsup-recherche.gouv.fr/fr/signature-du-marche-d-acquisition-de-l-un-des-supercalculateurs-les-plus-puissants-d-europe-46733)).
134
 
135
+ * Hardware: 128 V100 80GB GPUs (16 nodes):
 
 
 
 
136
 
137
+ * 4 GPUs per node
138
+
139
+ * 40 CPUs per task
140
+
141
+ * 1 task per node
142
+
143
  * CPU: AMD
144
 
145
  * CPU memory: 512GB per node
165
 
166
  #### **Training**
167
 
 
 
 
 
 
168
  - Checkpoint size:
169
 
170
+ - Fp16 weights: 2.6GB (# params * 2)
171
 
172
+ - Full checkpoint with optimizer states: --
173
 
174
+ - Training throughput: --
175
 
176
+ - Number of epochs: 1
177
 
178
  - Dates:
179
 
180
+ - Start: 11th March, 2022 11:42am PST
 
 
181
 
182
+ - End: 20 May, 2022
183
 
184
  - Server training location: Île-de-France, France
185