cakiki meg HF staff commited on
Commit
472cdd0
1 Parent(s): e9247ea

Updates the BLOOM model card to sync with the updated https://github.com/bigscience-workshop/model_card (#2)

Browse files

- Updates the BLOOM model card to sync with the updated https://github.com/bigscience-workshop/model_card (bec8ba3575dbd338dea1dc36044e8d54aba46c50)


Co-authored-by: Margaret Mitchell <meg@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +210 -112
README.md CHANGED
@@ -1,11 +1,61 @@
1
  ---
2
  license: bigscience-bloom-rail-1.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
4
 
5
  # <p>BLOOM LM<br/> _BigScience Large Open-source Open-access Multilingual Language Model_ <br/>Model Card</p>
6
- ![BigScience Logo](https://assets.website-files.com/6139f3cdcbbff3a68486761d/613cd8997b270da063e230c5_Tekengebied%201-p-500.png)
7
 
8
- Version 1.0 / 23.May.2022
 
9
 
10
  ## Table of Contents
11
  1. [Model Details](#model-details)
@@ -15,84 +65,100 @@ Version 1.0 / 23.May.2022
15
  5. [Evaluation](#evaluation)
16
  6. [Recommendations](#recommendations)
17
  7. [Glossary and Calculations](#glossary-and-calculations)
18
- 8. [Model Card Authors](#model-card-authors)
 
19
 
20
  ## Model Details
21
 
22
  ### Basics
23
  *This section provides information for anyone who wants to know about the model.*
 
24
  <details>
25
  <summary>Click to expand</summary> <br/>
26
 
27
- **Developed by:** [BigScience](https://bigscience.huggingface.co)
28
- * All collaborators are either volunteers or have an agreement with their employer. [Further breakdown of participants forthcoming.]
29
 
 
 
30
  **Model Type:** Transformer-based Language Model
31
 
32
  **Version:** 1.0.0
33
 
34
- **Languages:** Multiple; see [training data](#training-data).
35
 
36
- **License:** [RAIL License v1.0](https://docs.google.com/document/d/10NMjEKjxR7mrZ5CvugGBVaF6nPEgNxFBIbkH7z5HB-0/edit#)
37
 
38
- **Released:** [Forthcoming]
39
 
40
- **Send questions to:** bigscience-contact@googlegroups.com
41
 
42
- **Cite as:** [BigScience Workshop](https://bigscience.huggingface.co), BigScience Language Open-source Open-access Multilingual (BLOOM). International, May 2021-May 2022.
 
 
43
 
44
- **Funded by:** The French government, [Hugging Face](https://huggingface.co), and the organizations of contributors. [Further breakdown of organizations forthcoming.]
 
 
 
 
45
 
46
  </details>
47
 
48
  ### Technical Specifications
49
  *This section provides information for people who work on model development.*
 
50
  <details>
51
  <summary>Click to expand</summary><br/>
52
 
53
- *Please see [the BLOOM training README](https://github.com/bigscience-workshop/bigscience/tree/master/train/tr11-176B-ml#readme) for full details.*
54
 
55
- **Model Architecture:** Modified from Megatron-LM GPT2 ([paper link](https://arxiv.org/abs/1909.08053)):
56
 
57
- 1. Layer normalization applied to word embedding layer
58
 
59
- 2. [ALiBI positional encodings](https://arxiv.org/pdf/2108.12409.pdf)
60
 
61
- **Objective Function:** [Cross Entropy with mean reduction](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss)
62
 
63
- **Number of Parameters:** 176B parameters; 70 layers, 112 attention heads
64
 
65
- #### **Infrastructure**
66
-
67
- Compute Infrastructure: [Jean Zay](http://www.idris.fr/eng/jean-zay/jean-zay-presentation-eng.html) Public Supercomputer, provided by the French government
68
 
69
- Hardware: 384 A100 80GB GPUs (48 nodes)
70
 
71
- - Additional 32 A100 80GB GPUs (4 nodes) in reserve
72
 
73
- - 8 GPUs per node Using NVLink 4 inter-gpu connects, 4 OmniPath links
 
 
74
 
75
- - CPU: AMD
 
 
 
 
76
 
77
- - CPU memory: 512GB per node
78
 
79
- - GPU memory: 640GB per node
80
 
81
- - Inter-node connect: Omni-Path Architecture (OPA)
82
 
83
- - NCCL-communications network: a fully dedicated subnet
84
 
85
- - Disc IO network: shared network with other types of nodes
86
 
87
- Software:
88
 
89
- - [Megatron-DeepSpeed](https://github.com/bigscience-workshop/Megatron-DeepSpeed), BigScience fork
 
 
90
 
91
- - [DeepSpeed](https://github.com/microsoft/DeepSpeed)
92
 
93
- - [PyTorch](https://github.com/pytorch/pytorch)-1.11 w/ CUDA-11.5
94
 
95
- - [apex](https://github.com/NVIDIA/apex)
96
 
97
 
98
  #### **Training**
@@ -100,25 +166,38 @@ Software:
100
 
101
  _In progress._
102
 
103
- Checkpoint size:
 
 
 
 
104
 
105
- - Bf16 weights: 329GB
106
 
107
- - Full checkpoint with optimizer states: 2.3TB
108
 
109
- Training throughput: About 150 TFLOP per GPU per second
 
 
110
 
111
- Number of epochs: 1 (*current target*)
112
 
113
- Dates:
114
- - Started 11th March, 2022 11:42am PST.
115
- - Planned end: 5th July, 2022.
116
 
 
117
 
118
- Estimated cost of training: Equivalent of $7-15M
 
 
 
 
119
 
120
- Server training location: Ile-de-France, France
121
 
 
 
 
 
122
  </details>
123
 
124
 
@@ -127,29 +206,26 @@ Server training location: Ile-de-France, France
127
  <details>
128
  <summary>Click to expand</summary><br/>
129
 
130
- [More forthcoming when training has completed.]
131
-
132
- The training supercomputer, [Jean Zay]((http://www.idris.fr/eng/jean-zay/jean-zay-presentation-eng.html)), uses mostly nuclear energy.
133
-
134
- The heat generated by it is reused for heating campus housing.
135
 
136
- * Estimated carbon emissions: [Forthcoming]
 
 
137
 
138
- * Estimated electricity usage: [Forthcoming]
139
- </details>
140
 
 
141
  <p>&nbsp;</p>
142
 
143
  ## Uses
144
 
145
  *This section addresses questions around how the model is intended to be used, discusses the foreseeable users of the model (including those affected by the model), and describes uses that are considered out of scope or misuse of the model.
146
- It provides information for anyone considering using the model, or who is affected by the model.*
147
 
148
 
149
  <details>
150
  <summary>Click to expand</summary><br/>
151
 
152
- ### Intended use
153
 
154
  This model is being created in order to enable public research on large language models (LLMs). LLMs are intended to be used for language generation or as a pretrained base model that can be further fine-tuned for specific tasks. Use cases below are not exhaustive.
155
 
@@ -157,35 +233,34 @@ This model is being created in order to enable public research on large language
157
 
158
  - Text generation
159
 
160
- - Exploring characteristics of language generated by a language model.
161
 
162
- - Examples: Cloze tests, counterfactuals, generations with reframings.
163
 
164
  #### **Downstream Use**
165
 
166
- - Tasks that leverage language models include: Information Extraction, Question Answering, Summarization.
167
 
168
  ### Misuse and Out-of-scope Use
169
-
170
  *This section addresses what users ought not do with the model.*
171
 
172
- See the [LLM LICENSE ](https://docs.google.com/document/d/10NMjEKjxR7mrZ5CvugGBVaF6nPEgNxFBIbkH7z5HB-0/edit), Attachment A, for detailed usage restrictions. The below list is non-exhaustive, but lists some easily foreseeable problematic use cases.
173
 
174
  #### **Out-of-scope Uses**
175
 
176
- Using the model in [high-stakes](#glossary-and-calculations) settings is out of scope for this model. The model is not designed for [critical decisions](#glossary-and-calculations) nor uses with any material consequences on an individual's livelihood or wellbeing. The model outputs content that appears factual but is not correct.
177
 
178
- ##### Out-of-scope uses include:
179
 
180
- - Usage in biomedical domains, political and legal domains, or finance domains.
181
 
182
- - Usage for evaluating or scoring individuals, such as for employment, education, or credit.
183
 
184
- - Applying the model for critical automatic decisions, generating factual content, creating reliable summaries, or generating predictions that must be correct.
185
 
186
  #### **Misuse**
187
 
188
- Intentionally using the model for harm, violating rights, or other kinds of malicious activities is a misuse of this model. This includes:
189
 
190
  - Spam generation
191
 
@@ -195,14 +270,13 @@ Intentionally using the model for harm, violating rights, or other kinds of mali
195
 
196
  - Harassment and abuse
197
 
198
- - Deception
199
 
200
  - Unconsented impersonation and imitation
201
 
202
  - Unconsented surveillance
203
 
204
-
205
- - Generating content without attribution to the model, as specified in the [RAIL License, Use Restrictions](https://docs.google.com/document/d/10NMjEKjxR7mrZ5CvugGBVaF6nPEgNxFBIbkH7z5HB-0/edit#heading=h.3blioxkgzsje).
206
 
207
  ### Intended Users
208
 
@@ -224,17 +298,18 @@ Intentionally using the model for harm, violating rights, or other kinds of mali
224
 
225
  #### Indirect Users
226
 
227
- - Users of derivatives created by Direct Users, such as those using software with an [intended use](#intended-use).
228
 
229
- - Users of [Derivatives of the Model, as described in the License](https://docs.google.com/document/d/117RhytMYC9HS-1NmWHEn9XBK7vJ5kdv9OcG6AV69Vec/edit#bookmark=id.pvl8781qfes3).
230
 
231
- #### Others Affected (Parties prenantes)
232
 
233
  - People and groups referred to by the LLM
234
 
235
  - People and groups exposed to outputs of, or decisions based on, the LLM
236
 
237
  - People and groups whose original work is included in the LLM
 
238
  </details>
239
  <p>&nbsp;</p>
240
 
@@ -242,30 +317,27 @@ Intentionally using the model for harm, violating rights, or other kinds of mali
242
  *This section provides a high-level overview of the training data. It is relevant for anyone who wants to know the basics of what the model is learning.*
243
 
244
 
245
-
246
  <details>
247
  <summary>Click to expand</summary><br/>
248
 
249
- *Details for each dataset are provided in individual [Data Cards](https://huggingface.co/spaces/bigscience/BigScienceCorpus).*
250
 
251
  Training data includes:
252
 
253
- - 45 natural languages.
254
 
255
- - 12 programming languages.
256
 
257
- - In 1.5TB of pre-processed text, converted into 350B unique tokens.
258
 
259
- See the [Model README, Datasets for more](https://github.com/bigscience-workshop/bigscience/tree/master/train/tr11-176B-ml#datasets).
260
 
261
  #### **Languages**
 
262
  The pie chart shows the distribution of languages in training data.
263
 
264
  ![pie chart showing the distribution of languages in training data](https://github.com/bigscience-workshop/model_card/blob/main/assets/data/pie_chart.svg?raw=true)
265
 
266
 
267
-
268
-
269
  The following table shows the further distribution of Niger-Congo and Indic languages in the training data.
270
  <details>
271
  <summary>Click to expand</summary><br/>
@@ -333,8 +405,6 @@ The following table shows the distribution of programming languages.
333
  ## Risks and Limitations
334
  *This section identifies foreseeable harms and misunderstandings.*
335
 
336
-
337
-
338
  <details>
339
  <summary>Click to expand</summary><br/>
340
 
@@ -344,8 +414,7 @@ Model may:
344
 
345
  - Contain stereotypes
346
 
347
- - Contain personal information
348
-
349
 
350
  - Generate:
351
 
@@ -353,45 +422,45 @@ Model may:
353
 
354
  - Discriminatory or prejudicial language
355
 
356
- - Content that may not be appropriate for all settings, including sexual content.
357
 
358
- - Make errors, including producing incorrect information as if it were factual.
359
 
360
- - Generate irrelevant or repetitive outputs.
361
  </details>
362
  <p>&nbsp;</p>
363
 
364
  ## Evaluation
 
 
365
  <details>
366
  <summary>Click to expand</summary><br/>
367
 
368
  ### Metrics
369
- *This section describes the different ways performance is calculated, and why.*
370
-
371
- [More Forthcoming]
372
-
373
  Includes:
374
 
375
  | Metric | Why chosen |
376
  |--------------------|--------------------------------------------------------------------|
377
- | F1 | Standard for benchmarking |
378
- | Accuracy | Standard for benchmarking |
379
- | Perplexity | Standard metric for quantifying model improvements during training |
380
- | Cross Entropy Loss | Standard objective for language models |
381
 
382
- And multiple different metrics for specific tasks.
383
 
384
  ### Factors
385
  *This section lists some different aspects of what BLOOM models. Its focus is on those aspects that are likely to give rise to high variance in model behavior.*
386
 
387
  - Language, such as English or Yoruba
 
388
  - Domain, such as newswire or stories
 
389
  - Demographic characteristics, such as gender or nationality
390
 
391
  ### Results
392
  *Results are based on the [Factors](#factors) and [Metrics](#metrics).*
393
 
394
- **Train-time evaluation:**
395
 
396
  As of 19.May.2022, 18:00:
397
 
@@ -401,17 +470,16 @@ As of 19.May.2022, 18:00:
401
 
402
  - Perplexity: 9.15
403
 
404
- [More evaluation types forthcoming at the end of model training.]
405
- </details>
406
 
407
- <BR/>
 
408
 
409
  ## Recommendations
410
 
411
  *This section provides information on warnings and potential mitigations.*
412
 
413
 
414
-
415
  <details>
416
  <summary>Click to expand</summary><br/>
417
 
@@ -419,7 +487,7 @@ As of 19.May.2022, 18:00:
419
 
420
  - Users should be aware of [Risks and Limitations](#risks-and-limitations), and include an appropriate age disclaimer or blocking interface as necessary.
421
 
422
- - Models pre-trained with the LLM should include an updated Model Card.
423
 
424
  - Users of the model should provide mechanisms for those affected to provide feedback, such as an email address for comments.
425
 
@@ -435,28 +503,58 @@ As of 19.May.2022, 18:00:
435
  <details>
436
  <summary>Click to expand</summary><br/>
437
 
438
- - **Loss:** A calculation of the difference between what the model has learned and what the data shows ("groundtruth"). The lower the loss, the better. The training process aims to minimize the loss.
439
 
 
440
 
441
- - **Perplexity:** This is based on what the model estimates the probability of new data is. The lower the perplexity, the better. If the model is 100% correct at predicting the next token it will see, then the perplexity is 1. Mathematically this is calculated using entropy.
442
 
443
- - **High-stakes settings:** Such as those identified as "high-risk AI systems" and "unacceptable risk AI systems" in the European Union's proposed [Artificial Intelligence (AI) Act](https://artificialintelligenceact.eu/annexes/).
444
 
445
- - **Critical decisions**: Such as those defined in [the United States' proposed Algorithmic Accountability Act](https://www.congress.gov/117/bills/s3572/BILLS-117s3572is.pdf).
446
 
447
- - **Human Rights**: Includes those rights defined in the [Universal Declaration of Human Rights](https://www.un.org/sites/un2.un.org/files/2021/03/udhr.pdf).
448
-
449
- - **Personal Data and Information**: Personal data and information is defined in multiple data protection regulations, such as "[personal data](https://gdpr-info.eu/issues/personal-data/)" in the [European Union's General Data Protection Regulation](https://gdpr-info.eu); and "personal information" in the Republic of South Africa's [Protection of Personal Information Act](https://www.gov.za/sites/default/files/gcis_document/201409/3706726-11act4of2013popi.pdf), The People's Republic of China's [Personal information protection law](http://en.npc.gov.cn.cdurl.cn/2021-12/29/c_694559.htm).
450
 
451
- - **Sensitive Characteristics**: This includes specifically protected categories in human rights (see [UHDR, Article 2](https://www.un.org/sites/un2.un.org/files/2021/03/udhr.pdf)) and personal information regulation (see GDPR, [Article 9; Protection of Personal Information Act, Chapter 1](https://www.gov.za/sites/default/files/gcis_document/201409/3706726-11act4of2013popi.pdf))
452
 
453
- - **Deception:** Doing something to intentionally mislead individuals to believe something that is false, such as by creating deadbots or chatbots on social media posing as real people, or generating text documents without making consumers aware that the text is machine generated.
454
 
455
  </details>
456
  <p>&nbsp;</p>
457
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
458
  ## Model Card Authors
459
  *Ordered roughly chronologically and by amount of time spent.*
460
 
461
- Margaret Mitchell, Giada Pistilli, Yacine Jernite, Ezinwanne Ozoani, Marissa Gerchick, Nazneen Rajani, Sasha Luccioni, Irene Solaiman, Maraim Masoud, Somaieh Nikpoor, Carlos Muñoz Ferrandis, Stas Bekman, Danish Contractor, David Lansky, Angelina McMillan-Major, Tristan Thrush, Suzana Ilić, Gérard Dupont, Shayne Longpre, Manan Dey, Stella Biderman, Douwe Kiela, Emi Baylor, Teven Le Scao, Aaron Gokaslan, Julien Launay
462
-
1
  ---
2
  license: bigscience-bloom-rail-1.0
3
+ language:
4
+ - ak
5
+ - ar
6
+ - as
7
+ - bm
8
+ - bn
9
+ - ca
10
+ - code
11
+ - en
12
+ - es
13
+ - eu
14
+ - fon
15
+ - fr
16
+ - gu
17
+ - hi
18
+ - id
19
+ - ig
20
+ - ki
21
+ - kn
22
+ - lg
23
+ - ln
24
+ - ml
25
+ - mr
26
+ - ne
27
+ - nso
28
+ - ny
29
+ - or
30
+ - pa
31
+ - pt
32
+ - rn
33
+ - rw
34
+ - sn
35
+ - st
36
+ - sw
37
+ - ta
38
+ - te
39
+ - tn
40
+ - ts
41
+ - tum
42
+ - tw
43
+ - ur
44
+ - vi
45
+ - wo
46
+ - xh
47
+ - yo
48
+ - zh
49
+ - zhs
50
+ - zht
51
+ - zu
52
  ---
53
 
54
  # <p>BLOOM LM<br/> _BigScience Large Open-source Open-access Multilingual Language Model_ <br/>Model Card</p>
55
+ <img src="https://assets.website-files.com/6139f3cdcbbff3a68486761d/613cd8997b270da063e230c5_Tekengebied%201-p-500.png" alt="BigScience Logo" width="200"/>
56
 
57
+
58
+ Version 1.0 / 25.May.2022
59
 
60
  ## Table of Contents
61
  1. [Model Details](#model-details)
65
  5. [Evaluation](#evaluation)
66
  6. [Recommendations](#recommendations)
67
  7. [Glossary and Calculations](#glossary-and-calculations)
68
+ 8. [More Information](#more-information)
69
+ 9. [Model Card Authors](#model-card-authors)
70
 
71
  ## Model Details
72
 
73
  ### Basics
74
  *This section provides information for anyone who wants to know about the model.*
75
+
76
  <details>
77
  <summary>Click to expand</summary> <br/>
78
 
79
+ **Developed by:** BigScience ([website](https://bigscience.huggingface.co))
 
80
 
81
+ * All collaborators are either volunteers or have an agreement with their employer. *(Further breakdown of participants forthcoming.)*
82
+
83
  **Model Type:** Transformer-based Language Model
84
 
85
  **Version:** 1.0.0
86
 
87
+ **Languages:** Multiple; see [training data](#training-data)
88
 
89
+ **License:** RAIL License v1.0 ([link](https://huggingface.co/spaces/bigscience/license))
90
 
91
+ **Release Date Estimate:** Monday, 11.July.2022
92
 
93
+ **Send Questions to:** bigscience-contact@googlegroups.com
94
 
95
+ **Cite as:** BigScience, _BigScience Language Open-source Open-access Multilingual (BLOOM) Language Model_. International, May 2021-May 2022
96
+
97
+ **Funded by:**
98
 
99
+ * The French government.
100
+
101
+ * Hugging Face ([website](https://huggingface.co)).
102
+
103
+ * Organizations of contributors. *(Further breakdown of organizations forthcoming.)*
104
 
105
  </details>
106
 
107
  ### Technical Specifications
108
  *This section provides information for people who work on model development.*
109
+
110
  <details>
111
  <summary>Click to expand</summary><br/>
112
 
113
+ Please see [the BLOOM training README](https://github.com/bigscience-workshop/bigscience/tree/master/train/tr11-176B-ml#readme) for full details on replicating training.
114
 
115
+ **Model Architecture:** Modified from Megatron-LM GPT2 (see [paper](https://arxiv.org/abs/1909.08053), [BLOOM Megatron code](https://github.com/bigscience-workshop/Megatron-DeepSpeed)):
116
 
117
+ * Decoder-only architecture
118
 
119
+ * Layer normalization applied to word embeddings layer (`StableEmbedding`; see [code](https://github.com/facebookresearch/bitsandbytes), [paper](https://arxiv.org/pdf/2110.02861.pdf))
120
 
121
+ * ALiBI positional encodings (see [paper](https://arxiv.org/pdf/2108.12409.pdf)), with GeLU activation functions
122
 
123
+ * 176 billion parameters:
124
 
125
+ * 70 layers, 112 attention heads
 
 
126
 
127
+ * Hidden layers are 14336-dimensional
128
 
129
+ * Sequence length of 2048 tokens used (see [BLOOM tokenizer](https://huggingface.co/bigscience/tokenizer), [tokenizer description](#tokenization))
130
 
131
+ **Objective Function:** Cross Entropy with mean reduction (see [API documentation](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss)).
132
+
133
+ **Compute infrastructure:** Jean Zay Public Supercomputer, provided by the French government (see [announcement](https://www.enseignementsup-recherche.gouv.fr/fr/signature-du-marche-d-acquisition-de-l-un-des-supercalculateurs-les-plus-puissants-d-europe-46733)).
134
 
135
+ * Hardware: 384 A100 80GB GPUs (48 nodes):
136
+
137
+ * Additional 32 A100 80GB GPUs (4 nodes) in reserve
138
+
139
+ * 8 GPUs per node Using NVLink 4 inter-gpu connects, 4 OmniPath links
140
 
141
+ * CPU: AMD
142
 
143
+ * CPU memory: 512GB per node
144
 
145
+ * GPU memory: 640GB per node
146
 
147
+ * Inter-node connect: Omni-Path Architecture (OPA)
148
 
149
+ * NCCL-communications network: a fully dedicated subnet
150
 
151
+ * Disc IO network: shared network with other types of nodes
152
 
153
+ * Software:
154
+
155
+ * Megatron-DeepSpeed ([Github link](https://github.com/bigscience-workshop/Megatron-DeepSpeed))
156
 
157
+ * DeepSpeed ([Github link](https://github.com/microsoft/DeepSpeed))
158
 
159
+ * PyTorch (pytorch-1.11 w/ CUDA-11.5; see [Github link](https://github.com/pytorch/pytorch))
160
 
161
+ * apex ([Github link](https://github.com/NVIDIA/apex))
162
 
163
 
164
  #### **Training**
166
 
167
  _In progress._
168
 
169
+ - Checkpoint size:
170
+
171
+ - Bf16 weights: 329GB
172
+
173
+ - Full checkpoint with optimizer states: 2.3TB
174
 
175
+ - Training throughput: About 150 TFLOP per GPU per second
176
 
177
+ - Number of epochs: 1 (*current target*)
178
 
179
+ - Dates:
180
+
181
+ - Started 11th March, 2022 11:42am PST
182
 
183
+ - Estimated end: 5th July, 2022
184
 
185
+ - Estimated cost of training: Equivalent of $7-15M
 
 
186
 
187
+ - Server training location: Île-de-France, France
188
 
189
+ #### **Tokenization**
190
+
191
+ The BLOOM tokenizer ([link](https://huggingface.co/bigscience/tokenizer)) is a learned subword tokenizer trained using:
192
+
193
+ - A byte-level Byte Pair Encoding (BPE) algorithm
194
 
195
+ - A simple pre-tokenization rule, no normalization
196
 
197
+ - A vocabulary size of 250,680
198
+
199
+ It was trained on a subset of a preliminary version of the corpus using alpha-weighting per language.
200
+
201
  </details>
202
 
203
 
206
  <details>
207
  <summary>Click to expand</summary><br/>
208
 
209
+ The training supercomputer, Jean Zay ([website](http://www.idris.fr/eng/jean-zay/jean-zay-presentation-eng.html)), uses mostly nuclear energy. The heat generated by it is reused for heating campus housing.
 
 
 
 
210
 
211
+ **Estimated carbon emissions:** *(Forthcoming upon completion of training.)*
212
+
213
+ **Estimated electricity usage:** *(Forthcoming upon completion of training.)*
214
 
 
 
215
 
216
+ </details>
217
  <p>&nbsp;</p>
218
 
219
  ## Uses
220
 
221
  *This section addresses questions around how the model is intended to be used, discusses the foreseeable users of the model (including those affected by the model), and describes uses that are considered out of scope or misuse of the model.
222
+ It provides information for anyone considering using the model or who is affected by the model.*
223
 
224
 
225
  <details>
226
  <summary>Click to expand</summary><br/>
227
 
228
+ ### Intended Use
229
 
230
  This model is being created in order to enable public research on large language models (LLMs). LLMs are intended to be used for language generation or as a pretrained base model that can be further fine-tuned for specific tasks. Use cases below are not exhaustive.
231
 
233
 
234
  - Text generation
235
 
236
+ - Exploring characteristics of language generated by a language model
237
 
238
+ - Examples: Cloze tests, counterfactuals, generations with reframings
239
 
240
  #### **Downstream Use**
241
 
242
+ - Tasks that leverage language models include: Information Extraction, Question Answering, Summarization
243
 
244
  ### Misuse and Out-of-scope Use
 
245
  *This section addresses what users ought not do with the model.*
246
 
247
+ See the [BLOOM License](https://huggingface.co/spaces/bigscience/license), Attachment A, for detailed usage restrictions. The below list is non-exhaustive, but lists some easily foreseeable problematic use cases.
248
 
249
  #### **Out-of-scope Uses**
250
 
251
+ Using the model in [high-stakes](#high-stakes) settings is out of scope for this model.  The model is not designed for [critical decisions](#critical-decisions) nor uses with any material consequences on an individual's livelihood or wellbeing. The model outputs content that appears factual but is not correct.
252
 
253
+ ##### Out-of-scope Uses Include:
254
 
255
+ - Usage in biomedical domains, political and legal domains, or finance domains
256
 
257
+ - Usage for evaluating or scoring individuals, such as for employment, education, or credit
258
 
259
+ - Applying the model for critical automatic decisions, generating factual content, creating reliable summaries, or generating predictions that must be correct
260
 
261
  #### **Misuse**
262
 
263
+ Intentionally using the model for harm, violating [human rights](#human-rights), or other kinds of malicious activities, is a misuse of this model. This includes:
264
 
265
  - Spam generation
266
 
270
 
271
  - Harassment and abuse
272
 
273
+ - [Deception](#deception)
274
 
275
  - Unconsented impersonation and imitation
276
 
277
  - Unconsented surveillance
278
 
279
+ - Generating content without attribution to the model, as specified in the [RAIL License, Use Restrictions](https://huggingface.co/spaces/bigscience/license)
 
280
 
281
  ### Intended Users
282
 
298
 
299
  #### Indirect Users
300
 
301
+ - Users of derivatives created by Direct Users, such as those using software with an [intended use](#intended-use)
302
 
303
+ - Users of [Derivatives of the Model, as described in the License](https://huggingface.co/spaces/bigscience/license)
304
 
305
+ #### Others Affected (Parties Prenantes)
306
 
307
  - People and groups referred to by the LLM
308
 
309
  - People and groups exposed to outputs of, or decisions based on, the LLM
310
 
311
  - People and groups whose original work is included in the LLM
312
+
313
  </details>
314
  <p>&nbsp;</p>
315
 
317
  *This section provides a high-level overview of the training data. It is relevant for anyone who wants to know the basics of what the model is learning.*
318
 
319
 
 
320
  <details>
321
  <summary>Click to expand</summary><br/>
322
 
323
+ Details for each dataset are provided in individual [Data Cards](https://huggingface.co/spaces/bigscience/BigScienceCorpus).
324
 
325
  Training data includes:
326
 
327
+ - 45 natural languages
328
 
329
+ - 12 programming languages
330
 
331
+ - In 1.5TB of pre-processed text, converted into 350B unique tokens (see [the tokenizer section](#tokenization) for more.)
332
 
 
333
 
334
  #### **Languages**
335
+
336
  The pie chart shows the distribution of languages in training data.
337
 
338
  ![pie chart showing the distribution of languages in training data](https://github.com/bigscience-workshop/model_card/blob/main/assets/data/pie_chart.svg?raw=true)
339
 
340
 
 
 
341
  The following table shows the further distribution of Niger-Congo and Indic languages in the training data.
342
  <details>
343
  <summary>Click to expand</summary><br/>
405
  ## Risks and Limitations
406
  *This section identifies foreseeable harms and misunderstandings.*
407
 
 
 
408
  <details>
409
  <summary>Click to expand</summary><br/>
410
 
414
 
415
  - Contain stereotypes
416
 
417
+ - Contain [personal information](#personal-data-and-information)
 
418
 
419
  - Generate:
420
 
422
 
423
  - Discriminatory or prejudicial language
424
 
425
+ - Content that may not be appropriate for all settings, including sexual content
426
 
427
+ - Make errors, including producing incorrect information as if it were factual
428
 
429
+ - Generate irrelevant or repetitive outputs
430
  </details>
431
  <p>&nbsp;</p>
432
 
433
  ## Evaluation
434
+ *This section describes the evaluation protocols and provides the results.*
435
+
436
  <details>
437
  <summary>Click to expand</summary><br/>
438
 
439
  ### Metrics
440
+ *This section describes the different ways performance is calculated and why.*
441
+
 
 
442
  Includes:
443
 
444
  | Metric | Why chosen |
445
  |--------------------|--------------------------------------------------------------------|
446
+ | [Perplexity](#perplexity) | Standard metric for quantifying model improvements during training |
447
+ | Cross Entropy [Loss](#loss) | Standard objective for language models |
 
 
448
 
449
+ And multiple different metrics for specific tasks. _(More evaluation metrics forthcoming upon completion of evaluation protocol.)_
450
 
451
  ### Factors
452
  *This section lists some different aspects of what BLOOM models. Its focus is on those aspects that are likely to give rise to high variance in model behavior.*
453
 
454
  - Language, such as English or Yoruba
455
+
456
  - Domain, such as newswire or stories
457
+
458
  - Demographic characteristics, such as gender or nationality
459
 
460
  ### Results
461
  *Results are based on the [Factors](#factors) and [Metrics](#metrics).*
462
 
463
+ **Train-time Evaluation:**
464
 
465
  As of 19.May.2022, 18:00:
466
 
470
 
471
  - Perplexity: 9.15
472
 
473
+ (More evaluation scores forthcoming at the end of model training.)
 
474
 
475
+ </details>
476
+ <p>&nbsp;</p>
477
 
478
  ## Recommendations
479
 
480
  *This section provides information on warnings and potential mitigations.*
481
 
482
 
 
483
  <details>
484
  <summary>Click to expand</summary><br/>
485
 
487
 
488
  - Users should be aware of [Risks and Limitations](#risks-and-limitations), and include an appropriate age disclaimer or blocking interface as necessary.
489
 
490
+ - Models pretrained with the LLM should include an updated Model Card.
491
 
492
  - Users of the model should provide mechanisms for those affected to provide feedback, such as an email address for comments.
493
 
503
  <details>
504
  <summary>Click to expand</summary><br/>
505
 
506
+ - <a name="loss">**Loss:**</a> A calculation of the difference between what the model has learned and what the data shows ("groundtruth"). The lower the loss, the better. The training process aims to minimize the loss.
507
 
508
+ - <a name="perplexity">**Perplexity:**</a> This is based on what the model estimates the probability of new data is. The lower the perplexity, the better. If the model is 100% correct at predicting the next token it will see, then the perplexity is 1. Mathematically this is calculated using entropy.
509
 
510
+ - <a name="high-stakes">**High-stakes settings:**</a> Such as those identified as "high-risk AI systems" and "unacceptable risk AI systems" in the European Union's proposed [Artificial Intelligence (AI) Act](https://artificialintelligenceact.eu/annexes/).
511
 
512
+ - <a name="critical-decisions">**Critical decisions:**</a> Such as those defined in [the United States' proposed Algorithmic Accountability Act](https://www.congress.gov/117/bills/s3572/BILLS-117s3572is.pdf).
513
 
514
+ - <a name="human-rights">**Human rights:**</a> Includes those rights defined in the [Universal Declaration of Human Rights](https://www.un.org/sites/un2.un.org/files/2021/03/udhr.pdf).
515
 
516
+ - <a name="personal-data-and-information">**Personal Data and Personal Information:**</a> Personal data and information is defined in multiple data protection regulations, such as "[personal data](https://gdpr-info.eu/issues/personal-data/)" in the [European Union's General Data Protection Regulation](https://gdpr-info.eu); and "personal information" in the Republic of South Africa's [Protection of Personal Information Act](https://www.gov.za/sites/default/files/gcis_document/201409/3706726-11act4of2013popi.pdf), The People's Republic of China's [Personal information protection law](http://en.npc.gov.cn.cdurl.cn/2021-12/29/c_694559.htm).
 
 
517
 
518
+ - <a name="sensitive-characteristics">**Sensitive characteristics:**</a> This includes specifically protected categories in human rights (see [UHDR, Article 2](https://www.un.org/sites/un2.un.org/files/2021/03/udhr.pdf)) and personal information regulation (see GDPR, [Article 9; Protection of Personal Information Act, Chapter 1](https://www.gov.za/sites/default/files/gcis_document/201409/3706726-11act4of2013popi.pdf))
519
 
520
+ - <a name="deception">**Deception:**</a> Doing something to intentionally mislead individuals to believe something that is false, such as by creating deadbots or chatbots on social media posing as real people, or generating text documents without making consumers aware that the text is machine generated.
521
 
522
  </details>
523
  <p>&nbsp;</p>
524
 
525
+ ## More Information
526
+
527
+ <details>
528
+ <summary>Click to expand</summary><br/>
529
+
530
+ ### Dataset Creation
531
+
532
+ Blog post detailing the design choices during the dataset creation: https://bigscience.huggingface.co/blog/building-a-tb-scale-multilingual-dataset-for-language-modeling
533
+
534
+ ### Technical Specifications
535
+
536
+ Blog post summarizing how the architecture, size, shape, and pre-training duration where selected: https://bigscience.huggingface.co/blog/what-language-model-to-train-if-you-have-two-million-gpu-hours
537
+
538
+ More details on the architecture/optimizer: https://github.com/bigscience-workshop/bigscience/tree/master/train/tr11-176B-ml
539
+
540
+ Blog post on the hardware/engineering side: https://bigscience.huggingface.co/blog/which-hardware-to-train-a-176b-parameters-model
541
+
542
+ Details on the distributed setup used for the training: https://github.com/bigscience-workshop/bigscience/tree/master/train/tr11-176B-ml
543
+
544
+ Tensorboard updated during the training: https://huggingface.co/bigscience/tr11-176B-ml-logs/tensorboard#scalars&tagFilter=loss
545
+
546
+ Insights on how to approach training, negative results: https://github.com/bigscience-workshop/bigscience/blob/master/train/lessons-learned.md
547
+
548
+ Details on the obstacles overcome during the preparation on the engineering side (instabilities, optimization of training throughput, so many technical tricks and questions): https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/chronicles.md
549
+
550
+ ### Initial Results
551
+
552
+ Initial prompting experiments using interim checkpoints: https://huggingface.co/spaces/bigscience/bloom-book
553
+
554
+ </details>
555
+ <p>&nbsp;</p>
556
+
557
  ## Model Card Authors
558
  *Ordered roughly chronologically and by amount of time spent.*
559
 
560
+ Margaret Mitchell, Giada Pistilli, Yacine Jernite, Ezinwanne Ozoani, Marissa Gerchick, Nazneen Rajani, Sasha Luccioni, Irene Solaiman, Maraim Masoud, Somaieh Nikpoor, Carlos Muñoz Ferrandis, Stas Bekman, Danish Contractor, David Lansky, Angelina McMillan-Major, Tristan Thrush, Christopher Akiki, Suzana Ilić, Gérard Dupont, Shayne Longpre, Manan Dey, Stella Biderman, Douwe Kiela, Emi Baylor, Teven Le Scao, Aaron Gokaslan, Julien Launay