Attempts to fill out the 1B3 model details that diverge from the main one. (#1)

- Attempts to fill out the 1B3 model details that diverge from the main one. (2e65771c1edd81f6668220981f1264cd52e4b82e)
- Update hardware info (006a5e8a8f3f0152c00f40a630add316575fe6fc)
- Update hardware info (3be67070e2beeda3245efa61558c661eea2f52f1)

Files changed (1) hide show

README.md +18 -23

README.md CHANGED Viewed

@@ -120,11 +120,11 @@ Please see [the BLOOM training README](https://github.com/bigscience-workshop/bi
 * ALiBI positional encodings (see [paper](https://arxiv.org/pdf/2108.12409.pdf)), with GeLU activation functions
-* 176 billion parameters:
-    * 70 layers, 112 attention heads
-    * Hidden layers are 14336-dimensional
     * Sequence length of 2048 tokens used (see [BLOOM tokenizer](https://huggingface.co/bigscience/tokenizer), [tokenizer description](#tokenization))
@@ -132,17 +132,19 @@ Please see [the BLOOM training README](https://github.com/bigscience-workshop/bi
 **Compute infrastructure:** Jean Zay Public Supercomputer, provided by the French government (see [announcement](https://www.enseignementsup-recherche.gouv.fr/fr/signature-du-marche-d-acquisition-de-l-un-des-supercalculateurs-les-plus-puissants-d-europe-46733)).
-* Hardware: 384 A100 80GB GPUs (48 nodes):
-    * Additional 32 A100 80GB GPUs (4 nodes) in reserve
-    *  8 GPUs per node Using NVLink 4 inter-gpu connects, 4 OmniPath links
     *   CPU: AMD
-    *   CPU memory: 512GB per node
-    *   GPU memory: 640GB per node
     *   Inter-node connect: Omni-Path Architecture (OPA)
@@ -163,28 +165,21 @@ Please see [the BLOOM training README](https://github.com/bigscience-workshop/bi
 #### **Training**
-_In progress._
-Current training logs: [Tensorboard link](https://huggingface.co/tensorboard/bigscience/tr11-176B-ml-logs/)
 - Checkpoint size:
-    - Bf16 weights: 329GB
-    - Full checkpoint with optimizer states: 2.3TB
-- Training throughput: About 150 TFLOP per GPU per second
-- Number of epochs: 1 (*current target*)
 - Dates:
-    - Started 11th March, 2022 11:42am PST
-    - Estimated end: 5th July, 2022
-- Estimated cost of training: Equivalent of $2-5M in cloud computing (including preliminary experiments)
 - Server training location: Île-de-France, France

 * ALiBI positional encodings (see [paper](https://arxiv.org/pdf/2108.12409.pdf)), with GeLU activation functions
+* 1.3 billion parameters:
+    * 24 layers, 16 attention heads
+    * Hidden layers are 2048-dimensional
     * Sequence length of 2048 tokens used (see [BLOOM tokenizer](https://huggingface.co/bigscience/tokenizer), [tokenizer description](#tokenization))
 **Compute infrastructure:** Jean Zay Public Supercomputer, provided by the French government (see [announcement](https://www.enseignementsup-recherche.gouv.fr/fr/signature-du-marche-d-acquisition-de-l-un-des-supercalculateurs-les-plus-puissants-d-europe-46733)).
+* Hardware: 64 V100 16/32GB GPUs (16 nodes):
+    *   4 GPUs per node
+    *   40 CPUs per task
+    *   1 task per node
     *   CPU: AMD
+    *   CPU memory: 160GB per node
+    *   GPU memory: 64GB or 128GB (depending on node availability during training) per node
     *   Inter-node connect: Omni-Path Architecture (OPA)
 #### **Training**
 - Checkpoint size:
+    - Fp16 weights: 2.6GB (# params * 2)
+    - Full checkpoint with optimizer states: --
+- Training throughput: --
+- Number of epochs: 1
 - Dates:
+    - Start: 11th March, 2022 11:42am PST
+    - End: 20 May, 2022
 - Server training location: Île-de-France, France