meg HF staff commited on
Commit
6cb7cc9
1 Parent(s): 826ae56

Updates the model card in light of more developed format. (#11)

Browse files

- Updates the model card in light of more developed format. (e7e28b1ca071f697ad0cc98d30b32a0d3887768a)

Files changed (1) hide show
  1. README.md +195 -187
README.md CHANGED
@@ -58,9 +58,9 @@ pipeline_tag: text-generation
58
  <img src="https://assets.website-files.com/6139f3cdcbbff3a68486761d/613cd8997b270da063e230c5_Tekengebied%201-p-500.png" alt="BigScience Logo" width="200"/>
59
 
60
 
61
- Version 1.2 / 17.Jun.2022 - Current latest checkpoint: **Global step 80100**
62
 
63
- ## Table of Contents
64
  1. [Model Details](#model-details)
65
  2. [Uses](#uses)
66
  3. [Training Data](#training-data)
@@ -71,17 +71,22 @@ Version 1.2 / 17.Jun.2022 - Current latest checkpoint: **Global step 80100**
71
  8. [More Information](#more-information)
72
  9. [Model Card Authors](#model-card-authors)
73
 
74
- ## Model Details
 
 
75
 
76
- ### Basics
77
- *This section provides information for anyone who wants to know about the model.*
 
 
 
78
 
79
  <details>
80
- <summary>Click to expand</summary> <br/>
81
-
82
  **Developed by:** BigScience ([website](https://bigscience.huggingface.co))
83
 
84
- * All collaborators are either volunteers or have an agreement with their employer. *(Further breakdown of participants forthcoming.)*
85
 
86
  **Model Type:** Transformer-based Language Model
87
 
@@ -107,15 +112,18 @@ Version 1.2 / 17.Jun.2022 - Current latest checkpoint: **Global step 80100**
107
 
108
  </details>
109
 
110
- ### Technical Specifications
111
- *This section provides information for people who work on model development.*
 
112
 
113
  <details>
114
- <summary>Click to expand</summary><br/>
115
 
116
  Please see [the BLOOM training README](https://github.com/bigscience-workshop/bigscience/tree/master/train/tr11-176B-ml#readme) for full details on replicating training.
117
 
118
- **Model Architecture:** Modified from Megatron-LM GPT2 (see [paper](https://arxiv.org/abs/1909.08053), [BLOOM Megatron code](https://github.com/bigscience-workshop/Megatron-DeepSpeed)):
 
 
119
 
120
  * Decoder-only architecture
121
 
@@ -133,67 +141,130 @@ Please see [the BLOOM training README](https://github.com/bigscience-workshop/bi
133
 
134
  **Objective Function:** Cross Entropy with mean reduction (see [API documentation](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss)).
135
 
136
- **Compute infrastructure:** Jean Zay Public Supercomputer, provided by the French government (see [announcement](https://www.enseignementsup-recherche.gouv.fr/fr/signature-du-marche-d-acquisition-de-l-un-des-supercalculateurs-les-plus-puissants-d-europe-46733)).
 
 
 
137
 
138
- * Hardware: 384 A100 80GB GPUs (48 nodes):
139
 
140
- * Additional 32 A100 80GB GPUs (4 nodes) in reserve
141
 
142
- * 8 GPUs per node Using NVLink 4 inter-gpu connects, 4 OmniPath links
143
 
144
- * CPU: AMD
145
 
146
- * CPU memory: 512GB per node
147
 
148
- * GPU memory: 640GB per node
149
 
150
- * Inter-node connect: Omni-Path Architecture (OPA)
151
 
152
- * NCCL-communications network: a fully dedicated subnet
153
 
154
- * Disc IO network: shared network with other types of nodes
155
 
156
- * Software:
157
-
158
- * Megatron-DeepSpeed ([Github link](https://github.com/bigscience-workshop/Megatron-DeepSpeed))
159
 
160
- * DeepSpeed ([Github link](https://github.com/microsoft/DeepSpeed))
161
 
162
- * PyTorch (pytorch-1.11 w/ CUDA-11.5; see [Github link](https://github.com/pytorch/pytorch))
163
 
164
- * apex ([Github link](https://github.com/NVIDIA/apex))
165
 
 
 
 
166
 
167
- #### **Training**
168
 
169
-
170
- _In progress._
 
171
 
172
- Current training logs: [Tensorboard link](https://huggingface.co/tensorboard/bigscience/tr11-176B-ml-logs/)
 
173
 
174
- - Checkpoint size:
175
-
176
- - Bf16 weights: 329GB
177
-
178
- - Full checkpoint with optimizer states: 2.3TB
179
 
180
- - Training throughput: About 150 TFLOP per GPU per second
181
 
182
- - Number of epochs: 1 (*current target*)
183
 
184
- - Dates:
185
 
186
- - Started 11th March, 2022 11:42am PST
187
 
188
- - Estimated end: 5th July, 2022
189
 
190
- - Estimated cost of training: Equivalent of $2-5M in cloud computing (including preliminary experiments)
 
 
 
 
191
 
192
- - Server training location: Île-de-France, France
193
 
194
- #### **Tokenization**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
195
 
196
- The BLOOM tokenizer ([link](https://huggingface.co/bigscience/tokenizer)) is a learned subword tokenizer trained using:
 
 
197
 
198
  - A byte-level Byte Pair Encoding (BPE) algorithm
199
 
@@ -201,40 +272,58 @@ The BLOOM tokenizer ([link](https://huggingface.co/bigscience/tokenizer)) is a l
201
 
202
  - A vocabulary size of 250,680
203
 
204
- It was trained on a subset of a preliminary version of the corpus using alpha-weighting per language.
 
 
 
 
 
 
205
 
206
- </details>
207
 
 
208
 
209
- ### Environmental Impact
 
 
 
 
 
 
 
 
 
 
 
 
210
 
211
- <details>
212
- <summary>Click to expand</summary><br/>
213
 
214
  The training supercomputer, Jean Zay ([website](http://www.idris.fr/eng/jean-zay/jean-zay-presentation-eng.html)), uses mostly nuclear energy. The heat generated by it is reused for heating campus housing.
215
 
216
- **Estimated carbon emissions:** *(Forthcoming upon completion of training.)*
217
 
218
- **Estimated electricity usage:** *(Forthcoming upon completion of training.)*
219
-
220
 
221
  </details>
222
- <p>&nbsp;</p>
223
 
224
- ## Uses
225
 
226
- *This section addresses questions around how the model is intended to be used, discusses the foreseeable users of the model (including those affected by the model), and describes uses that are considered out of scope or misuse of the model.
227
- It provides information for anyone considering using the model or who is affected by the model.*
228
 
 
 
229
 
230
  <details>
231
- <summary>Click to expand</summary><br/>
232
 
233
- ### Intended Use
234
 
235
  This model is being created in order to enable public research on large language models (LLMs). LLMs are intended to be used for language generation or as a pretrained base model that can be further fine-tuned for specific tasks. Use cases below are not exhaustive.
236
 
237
- #### **Direct Use**
238
 
239
  - Text generation
240
 
@@ -242,7 +331,7 @@ This model is being created in order to enable public research on large language
242
 
243
  - Examples: Cloze tests, counterfactuals, generations with reframings
244
 
245
- #### **Downstream Use**
246
 
247
  - Tasks that leverage language models include: Information Extraction, Question Answering, Summarization
248
 
@@ -251,7 +340,7 @@ This model is being created in order to enable public research on large language
251
 
252
  See the [BLOOM License](https://huggingface.co/spaces/bigscience/license), Attachment A, for detailed usage restrictions. The below list is non-exhaustive, but lists some easily foreseeable problematic use cases.
253
 
254
- #### **Out-of-scope Uses**
255
 
256
  Using the model in [high-stakes](#high-stakes) settings is out of scope for this model.  The model is not designed for [critical decisions](#critical-decisions) nor uses with any material consequences on an individual's livelihood or wellbeing. The model outputs content that appears factual but is not correct.
257
 
@@ -263,7 +352,7 @@ Out-of-scope Uses Include:
263
 
264
  - Applying the model for critical automatic decisions, generating factual content, creating reliable summaries, or generating predictions that must be correct
265
 
266
- #### **Misuse**
267
 
268
  Intentionally using the model for harm, violating [human rights](#human-rights), or other kinds of malicious activities, is a misuse of this model. This includes:
269
 
@@ -283,9 +372,9 @@ Intentionally using the model for harm, violating [human rights](#human-rights),
283
 
284
  - Generating content without attribution to the model, as specified in the [RAIL License, Use Restrictions](https://huggingface.co/spaces/bigscience/license)
285
 
286
- ### Intended Users
287
 
288
- #### **Direct Users**
289
 
290
  - General Public
291
 
@@ -301,118 +390,30 @@ Intentionally using the model for harm, violating [human rights](#human-rights),
301
 
302
  - Community advocates, including human and civil rights groups
303
 
304
- #### Indirect Users
305
 
306
  - Users of derivatives created by Direct Users, such as those using software with an [intended use](#intended-use)
307
 
308
  - Users of [Derivatives of the Model, as described in the License](https://huggingface.co/spaces/bigscience/license)
309
 
310
- #### Others Affected (Parties Prenantes)
311
 
312
  - People and groups referred to by the LLM
313
 
314
  - People and groups exposed to outputs of, or decisions based on, the LLM
315
 
316
  - People and groups whose original work is included in the LLM
317
-
318
- </details>
319
- <p>&nbsp;</p>
320
-
321
- ## Training Data
322
- *This section provides a high-level overview of the training data. It is relevant for anyone who wants to know the basics of what the model is learning.*
323
-
324
-
325
- <details>
326
- <summary>Click to expand</summary><br/>
327
-
328
- Details for each dataset are provided in individual [Data Cards](https://huggingface.co/spaces/bigscience/BigScienceCorpus).
329
-
330
- Training data includes:
331
-
332
- - 45 natural languages
333
-
334
- - 12 programming languages
335
-
336
- - In 1.5TB of pre-processed text, converted into 350B unique tokens (see [the tokenizer section](#tokenization) for more.)
337
-
338
-
339
- #### **Languages**
340
-
341
- The pie chart shows the distribution of languages in training data.
342
-
343
- ![pie chart showing the distribution of languages in training data](https://github.com/bigscience-workshop/model_card/blob/main/assets/data/pie_chart.svg?raw=true)
344
 
345
-
346
- The following table shows the further distribution of Niger-Congo and Indic languages in the training data.
347
- <details>
348
- <summary>Click to expand</summary><br/>
349
-
350
- | Niger Congo | Percentage | | Indic | Percentage |
351
- |----------------|------------ |------ |-----------|------------|
352
- | Chi Tumbuka | 0.00002 | | Assamese | 0.01 |
353
- | Kikuyu | 0.00004 | | Odia | 0.04 |
354
- | Bambara | 0.00004 | | Gujarati | 0.04 |
355
- | Akan | 0.00007 | | Marathi | 0.05 |
356
- | Xitsonga | 0.00007 | | Punjabi | 0.05 |
357
- | Sesotho | 0.00007 | | Kannada | 0.06 |
358
- | Chi Chewa | 0.0001 | | Nepali | 0.07 |
359
- | Setswana | 0.0002 | | Telugu | 0.09 |
360
- | Northern Sotho | 0.0002 | | Malayalam | 0.10 |
361
- | Fon | 0.0002 | | Urdu | 0.10 |
362
- | Kirundi | 0.0003 | | Tamil | 0.20 |
363
- | Wolof | 0.0004 | | Bengali | 0.50 |
364
- | Kuganda | 0.0004 | | Hindi | 0.70 |
365
- | Chi Shona | 0.001 |
366
- | Isi Zulu | 0.001 |
367
- | Igbo | 0.001 |
368
- | Xhosa | 0.001 |
369
- | Kinyarwanda | 0.003 |
370
- | Yoruba | 0.006 |
371
- | Swahili | 0.02 |
372
  </details>
373
 
374
- The following table shows the distribution of programming languages.
375
- <details>
376
- <summary>Click to expand</summary><br/>
377
-
378
- | Extension | Language | Number of files |
379
- |----------------|------------|-----------------|
380
- | java | Java | 5,407,724 |
381
- | php | PHP | 4,942,186 |
382
- | cpp | C++ | 2,503,930 |
383
- | py | Python | 2,435,072 |
384
- | js | JavaScript | 1,905,518 |
385
- | cs | C# | 1,577,347 |
386
- | rb | Ruby | 6,78,413 |
387
- | cc | C++ | 443,054 |
388
- | hpp | C++ | 391,048 |
389
- | lua | Lua | 352,317 |
390
- | go | GO | 227,763 |
391
- | ts | TypeScript | 195,254 |
392
- | C | C | 134,537 |
393
- | scala | Scala | 92,052 |
394
- | hh | C++ | 67,161 |
395
- | H | C++ | 55,899 |
396
- | tsx | TypeScript | 33,107 |
397
- | rs | Rust | 29,693 |
398
- | phpt | PHP | 9,702 |
399
- | c++ | C++ | 1,342 |
400
- | h++ | C++ | 791 |
401
- | php3 | PHP | 540 |
402
- | phps | PHP | 270 |
403
- | php5 | PHP | 166 |
404
- | php4 | PHP | 29 |
405
-
406
- </details>
407
- </details>
408
- <p>&nbsp;</p>
409
 
410
- ## Risks and Limitations
411
  *This section identifies foreseeable harms and misunderstandings.*
412
-
413
- <details>
414
- <summary>Click to expand</summary><br/>
415
 
 
 
 
416
  Model may:
417
 
418
  - Overrepresent some viewpoints and underrepresent others
@@ -432,18 +433,22 @@ Model may:
432
  - Make errors, including producing incorrect information as if it were factual
433
 
434
  - Generate irrelevant or repetitive outputs
 
435
  </details>
436
- <p>&nbsp;</p>
437
 
438
- ## Evaluation
 
 
439
  *This section describes the evaluation protocols and provides the results.*
440
 
 
441
  <details>
442
- <summary>Click to expand</summary><br/>
443
 
444
- ### Metrics
445
  *This section describes the different ways performance is calculated and why.*
446
-
 
447
  Includes:
448
 
449
  | Metric | Why chosen |
@@ -453,7 +458,7 @@ Includes:
453
 
454
  And multiple different metrics for specific tasks. _(More evaluation metrics forthcoming upon completion of evaluation protocol.)_
455
 
456
- ### Factors
457
  *This section lists some different aspects of what BLOOM models. Its focus is on those aspects that are likely to give rise to high variance in model behavior.*
458
 
459
  - Language, such as English or Yoruba
@@ -462,7 +467,7 @@ And multiple different metrics for specific tasks. _(More evaluation metrics for
462
 
463
  - Demographic characteristics, such as gender or nationality
464
 
465
- ### Results
466
  *Results are based on the [Factors](#factors) and [Metrics](#metrics).*
467
 
468
  **Train-time Evaluation:**
@@ -475,18 +480,18 @@ As of 25.May.2022, 15:00 PST:
475
 
476
  - Perplexity: 8.9
477
 
478
- (More evaluation scores forthcoming at the end of model training.)
479
 
480
  </details>
481
- <p>&nbsp;</p>
482
 
483
- ## Recommendations
484
 
485
- *This section provides information on warnings and potential mitigations.*
486
 
 
487
 
488
  <details>
489
- <summary>Click to expand</summary><br/>
490
 
491
  - Indirect users should be made aware when the content they're working with is created by the LLM.
492
 
@@ -497,16 +502,14 @@ As of 25.May.2022, 15:00 PST:
497
  - Users of the model should provide mechanisms for those affected to provide feedback, such as an email address for comments.
498
 
499
  </details>
500
- <p>&nbsp;</p>
501
-
502
- ## Glossary and Calculations
503
-
504
- *This section defines common terms and how metrics are calculated.*
505
 
 
506
 
 
507
 
 
508
  <details>
509
- <summary>Click to expand</summary><br/>
510
 
511
  - <a name="loss">**Loss:**</a> A calculation of the difference between what the model has learned and what the data shows ("groundtruth"). The lower the loss, the better. The training process aims to minimize the loss.
512
 
@@ -525,18 +528,20 @@ As of 25.May.2022, 15:00 PST:
525
  - <a name="deception">**Deception:**</a> Doing something to intentionally mislead individuals to believe something that is false, such as by creating deadbots or chatbots on social media posing as real people, or generating text documents without making consumers aware that the text is machine generated.
526
 
527
  </details>
528
- <p>&nbsp;</p>
529
 
530
- ## More Information
 
 
 
531
 
532
  <details>
533
- <summary>Click to expand</summary><br/>
534
 
535
- ### Dataset Creation
536
 
537
  Blog post detailing the design choices during the dataset creation: https://bigscience.huggingface.co/blog/building-a-tb-scale-multilingual-dataset-for-language-modeling
538
 
539
- ### Technical Specifications
540
 
541
  Blog post summarizing how the architecture, size, shape, and pre-training duration where selected: https://bigscience.huggingface.co/blog/what-language-model-to-train-if-you-have-two-million-gpu-hours
542
 
@@ -548,18 +553,21 @@ Details on the distributed setup used for the training: https://github.com/bigsc
548
 
549
  Tensorboard updated during the training: https://huggingface.co/bigscience/tr11-176B-ml-logs/tensorboard#scalars&tagFilter=loss
550
 
 
 
551
  Insights on how to approach training, negative results: https://github.com/bigscience-workshop/bigscience/blob/master/train/lessons-learned.md
552
 
553
  Details on the obstacles overcome during the preparation on the engineering side (instabilities, optimization of training throughput, so many technical tricks and questions): https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/chronicles.md
554
 
555
- ### Initial Results
556
 
557
  Initial prompting experiments using interim checkpoints: https://huggingface.co/spaces/bigscience/bloom-book
558
 
559
  </details>
560
- <p>&nbsp;</p>
 
561
 
562
- ## Model Card Authors
563
  *Ordered roughly chronologically and by amount of time spent.*
564
 
565
  Margaret Mitchell, Giada Pistilli, Yacine Jernite, Ezinwanne Ozoani, Marissa Gerchick, Nazneen Rajani, Sasha Luccioni, Irene Solaiman, Maraim Masoud, Somaieh Nikpoor, Carlos Muñoz Ferrandis, Stas Bekman, Christopher Akiki, Danish Contractor, David Lansky, Angelina McMillan-Major, Tristan Thrush, Suzana Ilić, Gérard Dupont, Shayne Longpre, Manan Dey, Stella Biderman, Douwe Kiela, Emi Baylor, Teven Le Scao, Aaron Gokaslan, Julien Launay
58
  <img src="https://assets.website-files.com/6139f3cdcbbff3a68486761d/613cd8997b270da063e230c5_Tekengebied%201-p-500.png" alt="BigScience Logo" width="200"/>
59
 
60
 
61
+ Version 1.3 / 3.July.2022 - Checkpoint: **Global step 80100**
62
 
63
+ # Table of Contents
64
  1. [Model Details](#model-details)
65
  2. [Uses](#uses)
66
  3. [Training Data](#training-data)
71
  8. [More Information](#more-information)
72
  9. [Model Card Authors](#model-card-authors)
73
 
74
+ ---
75
+
76
+ # Model Details
77
 
78
+ BLOOM is a type of language model, which is a probability distribution over sequences of words. Specifically, BLOOM is a Large Language Model (LLM), meaning that it is trained on vast amounts of text data using industrial-scale computational resources. As such, the model is able to capture the statistical tendencies of words, phrases, sentences, and larger spans of text that it is exposed to in the training data.
79
+
80
+ ## Basics
81
+ *This section provides information about the model type, version, license, funders, release date, developers, and contact information.*
82
+ *It is useful for anyone who wants to reference the model.*
83
 
84
  <details>
85
+ <summary>Click to expand</summary>
86
+
87
  **Developed by:** BigScience ([website](https://bigscience.huggingface.co))
88
 
89
+ *All collaborators are either volunteers or have an agreement with their employer. (Further breakdown of participants forthcoming.)*
90
 
91
  **Model Type:** Transformer-based Language Model
92
 
112
 
113
  </details>
114
 
115
+ ## Technical Specifications
116
+ *This section includes details about the model objective and architecture, and the compute infrastructure.*
117
+ *It is useful for people interested in model development.*
118
 
119
  <details>
120
+ <summary>Click to expand</summary>
121
 
122
  Please see [the BLOOM training README](https://github.com/bigscience-workshop/bigscience/tree/master/train/tr11-176B-ml#readme) for full details on replicating training.
123
 
124
+ ### Model Architecture and Objective
125
+
126
+ * Modified from Megatron-LM GPT2 (see [paper](https://arxiv.org/abs/1909.08053), [BLOOM Megatron code](https://github.com/bigscience-workshop/Megatron-DeepSpeed)):
127
 
128
  * Decoder-only architecture
129
 
141
 
142
  **Objective Function:** Cross Entropy with mean reduction (see [API documentation](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss)).
143
 
144
+ ### Compute infrastructure
145
+ Jean Zay Public Supercomputer, provided by the French government (see [announcement](https://www.enseignementsup-recherche.gouv.fr/fr/signature-du-marche-d-acquisition-de-l-un-des-supercalculateurs-les-plus-puissants-d-europe-46733)).
146
+
147
+ #### Hardware
148
 
149
+ * 384 A100 80GB GPUs (48 nodes)
150
 
151
+ * Additional 32 A100 80GB GPUs (4 nodes) in reserve
152
 
153
+ * 8 GPUs per node Using NVLink 4 inter-gpu connects, 4 OmniPath links
154
 
155
+ * CPU: AMD
156
 
157
+ * CPU memory: 512GB per node
158
 
159
+ * GPU memory: 640GB per node
160
 
161
+ * Inter-node connect: Omni-Path Architecture (OPA)
162
 
163
+ * NCCL-communications network: a fully dedicated subnet
164
 
165
+ * Disc IO network: shared network with other types of nodes
166
 
167
+ #### Software
 
 
168
 
169
+ * Megatron-DeepSpeed ([Github link](https://github.com/bigscience-workshop/Megatron-DeepSpeed))
170
 
171
+ * DeepSpeed ([Github link](https://github.com/microsoft/DeepSpeed))
172
 
173
+ * PyTorch (pytorch-1.11 w/ CUDA-11.5; see [Github link](https://github.com/pytorch/pytorch))
174
 
175
+ * apex ([Github link](https://github.com/NVIDIA/apex))
176
+
177
+ </details>
178
 
179
+ ---
180
 
181
+ # Training
182
+ *This section provides information about the training data, the speed and size of training elements, and the environmental impact of training.*
183
+ *It is useful for people who want to learn more about the model inputs and training footprint.*
184
 
185
+ <details>
186
+ <summary>Click to expand</summary>
187
 
188
+ ## Training Data
189
+ *This section provides a high-level overview of the training data. It is relevant for anyone who wants to know the basics of what the model is learning.*
 
 
 
190
 
191
+ Details for each dataset are provided in individual [Data Cards](https://huggingface.co/spaces/bigscience/BigScienceCorpus).
192
 
193
+ Training data includes:
194
 
195
+ - 45 natural languages
196
 
197
+ - 12 programming languages
198
 
199
+ - In 1.5TB of pre-processed text, converted into 350B unique tokens (see [the tokenizer section](#tokenization) for more.)
200
 
201
+ ### Languages
202
+
203
+ The pie chart shows the distribution of languages in training data.
204
+
205
+ ![pie chart showing the distribution of languages in training data](https://github.com/bigscience-workshop/model_card/blob/main/assets/data/pie_chart.svg?raw=true)
206
 
 
207
 
208
+ The following tables shows the further distribution of Niger-Congo & Indic languages and programming languages in the training data.
209
+
210
+ Distribution of Niger Congo and Indic languages.
211
+
212
+ | Niger Congo | Percentage | | Indic | Percentage |
213
+ |----------------|------------ |------ |-----------|------------|
214
+ | Chi Tumbuka | 0.00002 | | Assamese | 0.01 |
215
+ | Kikuyu | 0.00004 | | Odia | 0.04 |
216
+ | Bambara | 0.00004 | | Gujarati | 0.04 |
217
+ | Akan | 0.00007 | | Marathi | 0.05 |
218
+ | Xitsonga | 0.00007 | | Punjabi | 0.05 |
219
+ | Sesotho | 0.00007 | | Kannada | 0.06 |
220
+ | Chi Chewa | 0.0001 | | Nepali | 0.07 |
221
+ | Setswana | 0.0002 | | Telugu | 0.09 |
222
+ | Northern Sotho | 0.0002 | | Malayalam | 0.10 |
223
+ | Fon | 0.0002 | | Urdu | 0.10 |
224
+ | Kirundi | 0.0003 | | Tamil | 0.20 |
225
+ | Wolof | 0.0004 | | Bengali | 0.50 |
226
+ | Kuganda | 0.0004 | | Hindi | 0.70 |
227
+ | Chi Shona | 0.001 |
228
+ | Isi Zulu | 0.001 |
229
+ | Igbo | 0.001 |
230
+ | Xhosa | 0.001 |
231
+ | Kinyarwanda | 0.003 |
232
+ | Yoruba | 0.006 |
233
+ | Swahili | 0.02 |
234
+
235
+ Distribution of programming languages.
236
+
237
+ | Extension | Language | Number of files |
238
+ |----------------|------------|-----------------|
239
+ | java | Java | 5,407,724 |
240
+ | php | PHP | 4,942,186 |
241
+ | cpp | C++ | 2,503,930 |
242
+ | py | Python | 2,435,072 |
243
+ | js | JavaScript | 1,905,518 |
244
+ | cs | C# | 1,577,347 |
245
+ | rb | Ruby | 6,78,413 |
246
+ | cc | C++ | 443,054 |
247
+ | hpp | C++ | 391,048 |
248
+ | lua | Lua | 352,317 |
249
+ | go | GO | 227,763 |
250
+ | ts | TypeScript | 195,254 |
251
+ | C | C | 134,537 |
252
+ | scala | Scala | 92,052 |
253
+ | hh | C++ | 67,161 |
254
+ | H | C++ | 55,899 |
255
+ | tsx | TypeScript | 33,107 |
256
+ | rs | Rust | 29,693 |
257
+ | phpt | PHP | 9,702 |
258
+ | c++ | C++ | 1,342 |
259
+ | h++ | C++ | 791 |
260
+ | php3 | PHP | 540 |
261
+ | phps | PHP | 270 |
262
+ | php5 | PHP | 166 |
263
+ | php4 | PHP | 29 |
264
 
265
+ ### Preprocessing
266
+
267
+ **Tokenization:** The BLOOM tokenizer ([link](https://huggingface.co/bigscience/tokenizer)), a learned subword tokenizer trained using:
268
 
269
  - A byte-level Byte Pair Encoding (BPE) algorithm
270
 
272
 
273
  - A vocabulary size of 250,680
274
 
275
+ It was trained on a subset of a preliminary version of the corpus using alpha-weighting per language.
276
+
277
+ ## Speeds, Sizes, Times
278
+
279
+ Training logs: [Tensorboard link](https://huggingface.co/tensorboard/bigscience/tr11-176B-ml-logs/)
280
+
281
+ - Dates:
282
 
283
+ - Started 11th March, 2022 11:42am PST
284
 
285
+ - Estimated end: 5th July, 2022
286
 
287
+ - Checkpoint size:
288
+
289
+ - Bf16 weights: 329GB
290
+
291
+ - Full checkpoint with optimizer states: 2.3TB
292
+
293
+ - Training throughput: About 150 TFLOP per GPU per second
294
+
295
+ - Number of epochs: 1
296
+
297
+ - Estimated cost of training: Equivalent of $2-5M in cloud computing (including preliminary experiments)
298
+
299
+ - Server training location: Île-de-France, France
300
 
301
+
302
+ ## Environmental Impact
303
 
304
  The training supercomputer, Jean Zay ([website](http://www.idris.fr/eng/jean-zay/jean-zay-presentation-eng.html)), uses mostly nuclear energy. The heat generated by it is reused for heating campus housing.
305
 
306
+ **Estimated carbon emissions:** *(Forthcoming.)*
307
 
308
+ **Estimated electricity usage:** *(Forthcoming.)*
 
309
 
310
  </details>
 
311
 
312
+ ---
313
 
314
+ # Uses
 
315
 
316
+ *This section addresses questions around how the model is intended to be used, discusses the foreseeable users of the model (including those affected by the model), and describes uses that are considered out of scope or misuse of the model.*
317
+ *It is useful for anyone considering using the model or who is affected by the model.*
318
 
319
  <details>
320
+ <summary>Click to expand</summary>
321
 
322
+ ## Intended Use
323
 
324
  This model is being created in order to enable public research on large language models (LLMs). LLMs are intended to be used for language generation or as a pretrained base model that can be further fine-tuned for specific tasks. Use cases below are not exhaustive.
325
 
326
+ ### Direct Use
327
 
328
  - Text generation
329
 
331
 
332
  - Examples: Cloze tests, counterfactuals, generations with reframings
333
 
334
+ ### Downstream Use
335
 
336
  - Tasks that leverage language models include: Information Extraction, Question Answering, Summarization
337
 
340
 
341
  See the [BLOOM License](https://huggingface.co/spaces/bigscience/license), Attachment A, for detailed usage restrictions. The below list is non-exhaustive, but lists some easily foreseeable problematic use cases.
342
 
343
+ #### Out-of-scope Uses
344
 
345
  Using the model in [high-stakes](#high-stakes) settings is out of scope for this model.  The model is not designed for [critical decisions](#critical-decisions) nor uses with any material consequences on an individual's livelihood or wellbeing. The model outputs content that appears factual but is not correct.
346
 
352
 
353
  - Applying the model for critical automatic decisions, generating factual content, creating reliable summaries, or generating predictions that must be correct
354
 
355
+ #### Misuse
356
 
357
  Intentionally using the model for harm, violating [human rights](#human-rights), or other kinds of malicious activities, is a misuse of this model. This includes:
358
 
372
 
373
  - Generating content without attribution to the model, as specified in the [RAIL License, Use Restrictions](https://huggingface.co/spaces/bigscience/license)
374
 
375
+ ## Intended Users
376
 
377
+ ### Direct Users
378
 
379
  - General Public
380
 
390
 
391
  - Community advocates, including human and civil rights groups
392
 
393
+ ### Indirect Users
394
 
395
  - Users of derivatives created by Direct Users, such as those using software with an [intended use](#intended-use)
396
 
397
  - Users of [Derivatives of the Model, as described in the License](https://huggingface.co/spaces/bigscience/license)
398
 
399
+ ### Others Affected (Parties Prenantes)
400
 
401
  - People and groups referred to by the LLM
402
 
403
  - People and groups exposed to outputs of, or decisions based on, the LLM
404
 
405
  - People and groups whose original work is included in the LLM
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
406
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
407
  </details>
408
 
409
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
410
 
411
+ # Risks and Limitations
412
  *This section identifies foreseeable harms and misunderstandings.*
 
 
 
413
 
414
+ <details>
415
+ <summary>Click to expand</summary>
416
+
417
  Model may:
418
 
419
  - Overrepresent some viewpoints and underrepresent others
433
  - Make errors, including producing incorrect information as if it were factual
434
 
435
  - Generate irrelevant or repetitive outputs
436
+
437
  </details>
 
438
 
439
+ ---
440
+
441
+ # Evaluation
442
  *This section describes the evaluation protocols and provides the results.*
443
 
444
+
445
  <details>
446
+ <summary>Click to expand</summary>
447
 
448
+ ## Metrics
449
  *This section describes the different ways performance is calculated and why.*
450
+
451
+
452
  Includes:
453
 
454
  | Metric | Why chosen |
458
 
459
  And multiple different metrics for specific tasks. _(More evaluation metrics forthcoming upon completion of evaluation protocol.)_
460
 
461
+ ## Factors
462
  *This section lists some different aspects of what BLOOM models. Its focus is on those aspects that are likely to give rise to high variance in model behavior.*
463
 
464
  - Language, such as English or Yoruba
467
 
468
  - Demographic characteristics, such as gender or nationality
469
 
470
+ ## Results
471
  *Results are based on the [Factors](#factors) and [Metrics](#metrics).*
472
 
473
  **Train-time Evaluation:**
480
 
481
  - Perplexity: 8.9
482
 
483
+ (More evaluation scores forthcoming.)
484
 
485
  </details>
 
486
 
487
+ ---
488
 
489
+ # Recommendations
490
 
491
+ *This section provides information on warnings and potential mitigations.*
492
 
493
  <details>
494
+ <summary>Click to expand</summary>
495
 
496
  - Indirect users should be made aware when the content they're working with is created by the LLM.
497
 
502
  - Users of the model should provide mechanisms for those affected to provide feedback, such as an email address for comments.
503
 
504
  </details>
 
 
 
 
 
505
 
506
+ ---
507
 
508
+ # Glossary and Calculations
509
 
510
+ *This section defines common terms and how metrics are calculated.*
511
  <details>
512
+ <summary>Click to expand</summary>
513
 
514
  - <a name="loss">**Loss:**</a> A calculation of the difference between what the model has learned and what the data shows ("groundtruth"). The lower the loss, the better. The training process aims to minimize the loss.
515
 
528
  - <a name="deception">**Deception:**</a> Doing something to intentionally mislead individuals to believe something that is false, such as by creating deadbots or chatbots on social media posing as real people, or generating text documents without making consumers aware that the text is machine generated.
529
 
530
  </details>
 
531
 
532
+ ---
533
+
534
+ # More Information
535
+ *This section provides links to writing on dataset creation, technical specifications, lessons learned, and initial results.*
536
 
537
  <details>
538
+ <summary>Click to expand</summary>
539
 
540
+ ## Dataset Creation
541
 
542
  Blog post detailing the design choices during the dataset creation: https://bigscience.huggingface.co/blog/building-a-tb-scale-multilingual-dataset-for-language-modeling
543
 
544
+ ## Technical Specifications
545
 
546
  Blog post summarizing how the architecture, size, shape, and pre-training duration where selected: https://bigscience.huggingface.co/blog/what-language-model-to-train-if-you-have-two-million-gpu-hours
547
 
553
 
554
  Tensorboard updated during the training: https://huggingface.co/bigscience/tr11-176B-ml-logs/tensorboard#scalars&tagFilter=loss
555
 
556
+ ## Lessons
557
+
558
  Insights on how to approach training, negative results: https://github.com/bigscience-workshop/bigscience/blob/master/train/lessons-learned.md
559
 
560
  Details on the obstacles overcome during the preparation on the engineering side (instabilities, optimization of training throughput, so many technical tricks and questions): https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/chronicles.md
561
 
562
+ ## Initial Results
563
 
564
  Initial prompting experiments using interim checkpoints: https://huggingface.co/spaces/bigscience/bloom-book
565
 
566
  </details>
567
+
568
+ ---
569
 
570
+ # Model Card Authors
571
  *Ordered roughly chronologically and by amount of time spent.*
572
 
573
  Margaret Mitchell, Giada Pistilli, Yacine Jernite, Ezinwanne Ozoani, Marissa Gerchick, Nazneen Rajani, Sasha Luccioni, Irene Solaiman, Maraim Masoud, Somaieh Nikpoor, Carlos Muñoz Ferrandis, Stas Bekman, Christopher Akiki, Danish Contractor, David Lansky, Angelina McMillan-Major, Tristan Thrush, Suzana Ilić, Gérard Dupont, Shayne Longpre, Manan Dey, Stella Biderman, Douwe Kiela, Emi Baylor, Teven Le Scao, Aaron Gokaslan, Julien Launay