thomwolf HF staff commited on
Commit
0cddfe8
1 Parent(s): 85d3785

Update README of the model

Browse files
Files changed (1) hide show
  1. README.md +49 -1
README.md CHANGED
@@ -1 +1,49 @@
1
- To be written, for now please see: [README.md](https://github.com/bigscience-workshop/bigscience/tree/master/train/tr11-176B-ml)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # BigScience Large Language Model Training
2
+ Training a multilingual 176 billion parameters model in the open
3
+ ![BigScience Logo](https://assets.website-files.com/6139f3cdcbbff3a68486761d/613cd8997b270da063e230c5_Tekengebied%201-p-500.png)
4
+
5
+ [BigScience](https://bigscience.huggingface.co) is a open and collaborative workshop around the study and creation of very large language models gathering more than 1000 researchers around the worlds. You can find more information on the main website at https://bigscience.huggingface.co.
6
+
7
+ The training of BigScience’s main model started on **March 11, 2022 11:42am PST** and will last 3-4 months on the 416 A100 GPUs of the Jean Zay public supercomputer
8
+
9
+ You can follow the training at [https://twitter.com/BigScienceLLM](https://twitter.com/BigScienceLLM)
10
+
11
+ ## More information on the model, dataset, hardware, environmental consideration:
12
+
13
+ ### **The model**
14
+
15
+ - 176B parameters decoder-only architecture (GPT-like)
16
+ - 70 layers - 112 attention heads per layers - hidden dimensionality of 14336 - 2048 tokens sequence length
17
+ - ALiBi positional embeddings - GeLU activation function
18
+ - **More information**:
19
+ - Blog post summarizing how the architecture, size, shape, and pre-training duration where selected: [https://bigscience.huggingface.co/blog/what-language-model-to-train-if-you-have-two-million-gpu-hours](https://bigscience.huggingface.co/blog/what-language-model-to-train-if-you-have-two-million-gpu-hours)
20
+ - More details on the architecture/optimizer: [https://github.com/bigscience-workshop/bigscience/tree/master/train/tr11-176B-ml](https://github.com/bigscience-workshop/bigscience/tree/master/train/tr11-176B-ml)
21
+
22
+ ### **The dataset**
23
+
24
+ - Multilingual: 46 languages: Full list is here: [https://bigscience.huggingface.co/blog/building-a-tb-scale-multilingual-dataset-for-language-modeling](https://bigscience.huggingface.co/blog/building-a-tb-scale-multilingual-dataset-for-language-modeling)
25
+ - 341.6 billion tokens (1.5 TB of text data)
26
+ - Tokenizer vocabulary: 250 680 tokens
27
+ - More information:
28
+ - Blog post detailing the design choices during the dataset creation: [https://bigscience.huggingface.co/blog/building-a-tb-scale-multilingual-dataset-for-language-modeling](https://bigscience.huggingface.co/blog/building-a-tb-scale-multilingual-dataset-for-language-modeling)
29
+
30
+ ### **The engineering side**
31
+
32
+ - number of GPU used for the training: 384 A100 GPU with 80 Gb of memory each
33
+ - one copy of the model takes 48 GPUs (using 60 GB of memory on each GPU)
34
+ - checkpoint size: only the bf16 weights are 329GB, the full checkpoint with optimizer states is 2.3TB
35
+ - training throughput: about 150 TFLOPs
36
+ - estimated training time: 3-4 months depending on throughput and unexpected events
37
+ - **More information**:
38
+ - Blog post on the hardware/engineering side: [https://bigscience.huggingface.co/blog/which-hardware-to-train-a-176b-parameters-model](https://bigscience.huggingface.co/blog/which-hardware-to-train-a-176b-parameters-model)
39
+ - Details on the distributed setup used for the training: [https://github.com/bigscience-workshop/bigscience/tree/master/train/tr11-176B-ml](https://github.com/bigscience-workshop/bigscience/tree/master/train/tr11-176B-ml)
40
+ - Tensorboard updated during the training: [https://huggingface.co/bigscience/tr11-176B-ml-logs/tensorboard#scalars&tagFilter=loss](https://huggingface.co/bigscience/tr11-176B-ml-logs/tensorboard#scalars&tagFilter=loss)
41
+ - Details on the obstacles overcome during the preparation on the engineering side (instabilities, optimization of training throughput, so many technical tricks and questions): [https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/chronicles.md](https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/chronicles.md)
42
+
43
+ ### **Environmental considerations**
44
+
45
+ - [Jean Zay](http://www.idris.fr/eng/jean-zay/jean-zay-presentation-eng.html), the supercomputer we are using for model training, is mostly powered by nuclear energy, which is a low carbon energy source.
46
+ - Significant efforts were made to make sure that the computing infrastructure is as efficient as possible — the heat generated by the hardware even gets used for heating buildings on campus!
47
+ - **More information**:
48
+ - We are currently working on making a precise estimate of the carbon emitted during all of the steps of model training, including intermediate experiments as well as inference.
49
+ - More soon!