meg HF staff commited on
Commit
5c72534
1 Parent(s): e4ae036

Copy+Paste from updated Model Card, at https://huggingface.co/bigscience/bloom

Browse files
Files changed (1) hide show
  1. README.md +221 -111
README.md CHANGED
@@ -1,11 +1,61 @@
1
  ---
2
  license: bigscience-bloom-rail-1.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
4
 
5
- # <p>BLOOM LM<br/> _BigScience Large Open-source Open-access Multilingual Language Model_ <br/>Model Card</p>
6
- ![BigScience Logo](https://assets.website-files.com/6139f3cdcbbff3a68486761d/613cd8997b270da063e230c5_Tekengebied%201-p-500.png)
7
 
8
- Version 1.0 / 24.May.2022
 
9
 
10
  ## Table of Contents
11
  1. [Model Details](#model-details)
@@ -15,84 +65,100 @@ Version 1.0 / 24.May.2022
15
  5. [Evaluation](#evaluation)
16
  6. [Recommendations](#recommendations)
17
  7. [Glossary and Calculations](#glossary-and-calculations)
18
- 8. [Model Card Authors](#model-card-authors)
 
19
 
20
  ## Model Details
21
 
22
  ### Basics
23
  *This section provides information for anyone who wants to know about the model.*
 
24
  <details>
25
  <summary>Click to expand</summary> <br/>
26
 
27
- **Developed by:** [BigScience](https://bigscience.huggingface.co)
28
- * All collaborators are either volunteers or have an agreement with their employer. [Further breakdown of participants forthcoming.]
29
 
 
 
30
  **Model Type:** Transformer-based Language Model
31
 
32
  **Version:** 1.0.0
33
 
34
- **Languages:** Multiple; see [training data](#training-data).
35
 
36
- **License:** [RAIL License v1.0](https://docs.google.com/document/d/10NMjEKjxR7mrZ5CvugGBVaF6nPEgNxFBIbkH7z5HB-0/edit#)
37
 
38
- **Released:** [Forthcoming]
39
 
40
- **Send questions to:** bigscience-contact@googlegroups.com
41
 
42
- **Cite as:** [BigScience Workshop](https://bigscience.huggingface.co), BigScience Language Open-source Open-access Multilingual (BLOOM). International, May 2021-May 2022.
 
 
43
 
44
- **Funded by:** The French government, [Hugging Face](https://huggingface.co), and the organizations of contributors. [Further breakdown of organizations forthcoming.]
 
 
 
 
45
 
46
  </details>
47
 
48
  ### Technical Specifications
49
  *This section provides information for people who work on model development.*
 
50
  <details>
51
  <summary>Click to expand</summary><br/>
52
 
53
- *Please see [the BLOOM training README](https://github.com/bigscience-workshop/bigscience/tree/master/train/tr11-176B-ml#readme) for full details.*
54
 
55
- **Model Architecture:** Modified from Megatron-LM GPT2 ([paper link](https://arxiv.org/abs/1909.08053)):
56
 
57
- 1. Layer normalization applied to word embedding layer
58
 
59
- 2. [ALiBI positional encodings](https://arxiv.org/pdf/2108.12409.pdf)
60
 
61
- **Objective Function:** [Cross Entropy with mean reduction](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss)
62
 
63
- **Number of Parameters:** 1B3 parameters; 24 layers, 16 attention heads
64
 
65
- #### **Infrastructure**
66
-
67
- Compute Infrastructure: [Jean Zay](http://www.idris.fr/eng/jean-zay/jean-zay-presentation-eng.html) Public Supercomputer, provided by the French government
68
 
69
- Hardware: 384 A100 80GB GPUs (48 nodes)
70
 
71
- - Additional 32 A100 80GB GPUs (4 nodes) in reserve
 
 
72
 
73
- - 8 GPUs per node Using NVLink 4 inter-gpu connects, 4 OmniPath links
 
 
74
 
75
- - CPU: AMD
76
 
77
- - CPU memory: 512GB per node
78
 
79
- - GPU memory: 640GB per node
80
 
81
- - Inter-node connect: Omni-Path Architecture (OPA)
82
 
83
- - NCCL-communications network: a fully dedicated subnet
84
 
85
- - Disc IO network: shared network with other types of nodes
86
 
87
- Software:
88
 
89
- - [Megatron-DeepSpeed](https://github.com/bigscience-workshop/Megatron-DeepSpeed), BigScience fork
 
 
90
 
91
- - [DeepSpeed](https://github.com/microsoft/DeepSpeed)
92
 
93
- - [PyTorch](https://github.com/pytorch/pytorch)-1.11 w/ CUDA-11.5
94
 
95
- - [apex](https://github.com/NVIDIA/apex)
96
 
97
 
98
  #### **Training**
@@ -100,24 +166,40 @@ Software:
100
 
101
  _In progress._
102
 
103
- Checkpoint size:
104
 
105
- - fp16 weights: 1.98GB
 
 
 
 
106
 
 
107
 
108
- Training throughput: About 150 TFLOP per GPU per second
109
 
110
- Number of steps: 340500
 
 
 
 
111
 
112
- Dates:
113
- - Started: to determine
114
- - Ended: to determine
115
 
 
116
 
117
- Estimated cost of training: Unknown
 
 
 
 
118
 
119
- Server training location: Ile-de-France, France
120
 
 
 
 
 
121
  </details>
122
 
123
 
@@ -126,29 +208,26 @@ Server training location: Ile-de-France, France
126
  <details>
127
  <summary>Click to expand</summary><br/>
128
 
129
- [More forthcoming when training has completed.]
130
-
131
- The training supercomputer, [Jean Zay]((http://www.idris.fr/eng/jean-zay/jean-zay-presentation-eng.html)), uses mostly nuclear energy.
132
-
133
- The heat generated by it is reused for heating campus housing.
134
 
135
- * Estimated carbon emissions: [Forthcoming]
136
 
137
- * Estimated electricity usage: [Forthcoming]
138
- </details>
139
 
 
140
  <p>&nbsp;</p>
141
 
142
  ## Uses
143
 
144
  *This section addresses questions around how the model is intended to be used, discusses the foreseeable users of the model (including those affected by the model), and describes uses that are considered out of scope or misuse of the model.
145
- It provides information for anyone considering using the model, or who is affected by the model.*
146
 
147
 
148
  <details>
149
  <summary>Click to expand</summary><br/>
150
 
151
- ### Intended use
152
 
153
  This model is being created in order to enable public research on large language models (LLMs). LLMs are intended to be used for language generation or as a pretrained base model that can be further fine-tuned for specific tasks. Use cases below are not exhaustive.
154
 
@@ -156,35 +235,34 @@ This model is being created in order to enable public research on large language
156
 
157
  - Text generation
158
 
159
- - Exploring characteristics of language generated by a language model.
160
 
161
- - Examples: Cloze tests, counterfactuals, generations with reframings.
162
 
163
  #### **Downstream Use**
164
 
165
- - Tasks that leverage language models include: Information Extraction, Question Answering, Summarization.
166
 
167
  ### Misuse and Out-of-scope Use
168
-
169
  *This section addresses what users ought not do with the model.*
170
 
171
- See the [LLM LICENSE ](https://docs.google.com/document/d/10NMjEKjxR7mrZ5CvugGBVaF6nPEgNxFBIbkH7z5HB-0/edit), Attachment A, for detailed usage restrictions. The below list is non-exhaustive, but lists some easily foreseeable problematic use cases.
172
 
173
  #### **Out-of-scope Uses**
174
 
175
- Using the model in [high-stakes](#glossary-and-calculations) settings is out of scope for this model. The model is not designed for [critical decisions](#glossary-and-calculations) nor uses with any material consequences on an individual's livelihood or wellbeing. The model outputs content that appears factual but is not correct.
176
 
177
- ##### Out-of-scope uses include:
178
 
179
- - Usage in biomedical domains, political and legal domains, or finance domains.
180
 
181
- - Usage for evaluating or scoring individuals, such as for employment, education, or credit.
182
 
183
- - Applying the model for critical automatic decisions, generating factual content, creating reliable summaries, or generating predictions that must be correct.
184
 
185
  #### **Misuse**
186
 
187
- Intentionally using the model for harm, violating rights, or other kinds of malicious activities is a misuse of this model. This includes:
188
 
189
  - Spam generation
190
 
@@ -194,14 +272,13 @@ Intentionally using the model for harm, violating rights, or other kinds of mali
194
 
195
  - Harassment and abuse
196
 
197
- - Deception
198
 
199
  - Unconsented impersonation and imitation
200
 
201
  - Unconsented surveillance
202
 
203
-
204
- - Generating content without attribution to the model, as specified in the [RAIL License, Use Restrictions](https://docs.google.com/document/d/10NMjEKjxR7mrZ5CvugGBVaF6nPEgNxFBIbkH7z5HB-0/edit#heading=h.3blioxkgzsje).
205
 
206
  ### Intended Users
207
 
@@ -223,17 +300,18 @@ Intentionally using the model for harm, violating rights, or other kinds of mali
223
 
224
  #### Indirect Users
225
 
226
- - Users of derivatives created by Direct Users, such as those using software with an [intended use](#intended-use).
227
 
228
- - Users of [Derivatives of the Model, as described in the License](https://docs.google.com/document/d/117RhytMYC9HS-1NmWHEn9XBK7vJ5kdv9OcG6AV69Vec/edit#bookmark=id.pvl8781qfes3).
229
 
230
- #### Others Affected (Parties prenantes)
231
 
232
  - People and groups referred to by the LLM
233
 
234
  - People and groups exposed to outputs of, or decisions based on, the LLM
235
 
236
  - People and groups whose original work is included in the LLM
 
237
  </details>
238
  <p>&nbsp;</p>
239
 
@@ -241,30 +319,27 @@ Intentionally using the model for harm, violating rights, or other kinds of mali
241
  *This section provides a high-level overview of the training data. It is relevant for anyone who wants to know the basics of what the model is learning.*
242
 
243
 
244
-
245
  <details>
246
  <summary>Click to expand</summary><br/>
247
 
248
- *Details for each dataset are provided in individual [Data Cards](https://huggingface.co/spaces/bigscience/BigScienceCorpus).*
249
 
250
  Training data includes:
251
 
252
- - 45 natural languages.
253
 
254
- - 12 programming languages.
255
 
256
- - In 1.5TB of pre-processed text, converted into 350B unique tokens.
257
 
258
- See the [Model README, Datasets for more](https://github.com/bigscience-workshop/bigscience/tree/master/train/tr11-176B-ml#datasets).
259
 
260
  #### **Languages**
 
261
  The pie chart shows the distribution of languages in training data.
262
 
263
  ![pie chart showing the distribution of languages in training data](https://github.com/bigscience-workshop/model_card/blob/main/assets/data/pie_chart.svg?raw=true)
264
 
265
 
266
-
267
-
268
  The following table shows the further distribution of Niger-Congo and Indic languages in the training data.
269
  <details>
270
  <summary>Click to expand</summary><br/>
@@ -332,8 +407,6 @@ The following table shows the distribution of programming languages.
332
  ## Risks and Limitations
333
  *This section identifies foreseeable harms and misunderstandings.*
334
 
335
-
336
-
337
  <details>
338
  <summary>Click to expand</summary><br/>
339
 
@@ -343,8 +416,7 @@ Model may:
343
 
344
  - Contain stereotypes
345
 
346
- - Contain personal information
347
-
348
 
349
  - Generate:
350
 
@@ -352,57 +424,64 @@ Model may:
352
 
353
  - Discriminatory or prejudicial language
354
 
355
- - Content that may not be appropriate for all settings, including sexual content.
356
 
357
- - Make errors, including producing incorrect information as if it were factual.
358
 
359
- - Generate irrelevant or repetitive outputs.
360
  </details>
361
  <p>&nbsp;</p>
362
 
363
  ## Evaluation
 
 
364
  <details>
365
  <summary>Click to expand</summary><br/>
366
 
367
  ### Metrics
368
- *This section describes the different ways performance is calculated, and why.*
369
-
370
- [More Forthcoming]
371
-
372
  Includes:
373
 
374
  | Metric | Why chosen |
375
  |--------------------|--------------------------------------------------------------------|
376
- | F1 | Standard for benchmarking |
377
- | Accuracy | Standard for benchmarking |
378
- | Perplexity | Standard metric for quantifying model improvements during training |
379
- | Cross Entropy Loss | Standard objective for language models |
380
 
381
- And multiple different metrics for specific tasks.
382
 
383
  ### Factors
384
  *This section lists some different aspects of what BLOOM models. Its focus is on those aspects that are likely to give rise to high variance in model behavior.*
385
 
386
  - Language, such as English or Yoruba
 
387
  - Domain, such as newswire or stories
 
388
  - Demographic characteristics, such as gender or nationality
389
 
390
  ### Results
391
  *Results are based on the [Factors](#factors) and [Metrics](#metrics).*
392
 
393
- **Train-time evaluation:**
394
 
395
- [More evaluation types forthcoming at the end of model training.]
396
- </details>
 
 
 
397
 
398
- <BR/>
 
 
 
 
 
399
 
400
  ## Recommendations
401
 
402
  *This section provides information on warnings and potential mitigations.*
403
 
404
 
405
-
406
  <details>
407
  <summary>Click to expand</summary><br/>
408
 
@@ -410,7 +489,7 @@ And multiple different metrics for specific tasks.
410
 
411
  - Users should be aware of [Risks and Limitations](#risks-and-limitations), and include an appropriate age disclaimer or blocking interface as necessary.
412
 
413
- - Models pre-trained with the LLM should include an updated Model Card.
414
 
415
  - Users of the model should provide mechanisms for those affected to provide feedback, such as an email address for comments.
416
 
@@ -426,27 +505,58 @@ And multiple different metrics for specific tasks.
426
  <details>
427
  <summary>Click to expand</summary><br/>
428
 
429
- - **Loss:** A calculation of the difference between what the model has learned and what the data shows ("groundtruth"). The lower the loss, the better. The training process aims to minimize the loss.
430
 
 
431
 
432
- - **Perplexity:** This is based on what the model estimates the probability of new data is. The lower the perplexity, the better. If the model is 100% correct at predicting the next token it will see, then the perplexity is 1. Mathematically this is calculated using entropy.
433
 
434
- - **High-stakes settings:** Such as those identified as "high-risk AI systems" and "unacceptable risk AI systems" in the European Union's proposed [Artificial Intelligence (AI) Act](https://artificialintelligenceact.eu/annexes/).
435
 
436
- - **Critical decisions**: Such as those defined in [the United States' proposed Algorithmic Accountability Act](https://www.congress.gov/117/bills/s3572/BILLS-117s3572is.pdf).
437
 
438
- - **Human Rights**: Includes those rights defined in the [Universal Declaration of Human Rights](https://www.un.org/sites/un2.un.org/files/2021/03/udhr.pdf).
439
-
440
- - **Personal Data and Information**: Personal data and information is defined in multiple data protection regulations, such as "[personal data](https://gdpr-info.eu/issues/personal-data/)" in the [European Union's General Data Protection Regulation](https://gdpr-info.eu); and "personal information" in the Republic of South Africa's [Protection of Personal Information Act](https://www.gov.za/sites/default/files/gcis_document/201409/3706726-11act4of2013popi.pdf), The People's Republic of China's [Personal information protection law](http://en.npc.gov.cn.cdurl.cn/2021-12/29/c_694559.htm).
441
 
442
- - **Sensitive Characteristics**: This includes specifically protected categories in human rights (see [UHDR, Article 2](https://www.un.org/sites/un2.un.org/files/2021/03/udhr.pdf)) and personal information regulation (see GDPR, [Article 9; Protection of Personal Information Act, Chapter 1](https://www.gov.za/sites/default/files/gcis_document/201409/3706726-11act4of2013popi.pdf))
443
 
444
- - **Deception:** Doing something to intentionally mislead individuals to believe something that is false, such as by creating deadbots or chatbots on social media posing as real people, or generating text documents without making consumers aware that the text is machine generated.
445
 
446
  </details>
447
  <p>&nbsp;</p>
448
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
449
  ## Model Card Authors
450
  *Ordered roughly chronologically and by amount of time spent.*
451
 
452
- Margaret Mitchell, Giada Pistilli, Yacine Jernite, Ezinwanne Ozoani, Marissa Gerchick, Nazneen Rajani, Sasha Luccioni, Irene Solaiman, Maraim Masoud, Somaieh Nikpoor, Carlos Muñoz Ferrandis, Stas Bekman, Danish Contractor, David Lansky, Angelina McMillan-Major, Tristan Thrush, Suzana Ilić, Gérard Dupont, Shayne Longpre, Manan Dey, Stella Biderman, Douwe Kiela, Emi Baylor, Teven Le Scao, Aaron Gokaslan, Julien Launay
1
  ---
2
  license: bigscience-bloom-rail-1.0
3
+ language:
4
+ - ak
5
+ - ar
6
+ - as
7
+ - bm
8
+ - bn
9
+ - ca
10
+ - code
11
+ - en
12
+ - es
13
+ - eu
14
+ - fon
15
+ - fr
16
+ - gu
17
+ - hi
18
+ - id
19
+ - ig
20
+ - ki
21
+ - kn
22
+ - lg
23
+ - ln
24
+ - ml
25
+ - mr
26
+ - ne
27
+ - nso
28
+ - ny
29
+ - or
30
+ - pa
31
+ - pt
32
+ - rn
33
+ - rw
34
+ - sn
35
+ - st
36
+ - sw
37
+ - ta
38
+ - te
39
+ - tn
40
+ - ts
41
+ - tum
42
+ - tw
43
+ - ur
44
+ - vi
45
+ - wo
46
+ - xh
47
+ - yo
48
+ - zh
49
+ - zhs
50
+ - zht
51
+ - zu
52
  ---
53
 
54
+ # <p>BLOOM LM<br/> _BigScience Large Open-science Open-access Multilingual Language Model_ <br/>Model Card</p>
55
+ <img src="https://assets.website-files.com/6139f3cdcbbff3a68486761d/613cd8997b270da063e230c5_Tekengebied%201-p-500.png" alt="BigScience Logo" width="200"/>
56
 
57
+
58
+ Version 1.0 / 26.May.2022
59
 
60
  ## Table of Contents
61
  1. [Model Details](#model-details)
65
  5. [Evaluation](#evaluation)
66
  6. [Recommendations](#recommendations)
67
  7. [Glossary and Calculations](#glossary-and-calculations)
68
+ 8. [More Information](#more-information)
69
+ 9. [Model Card Authors](#model-card-authors)
70
 
71
  ## Model Details
72
 
73
  ### Basics
74
  *This section provides information for anyone who wants to know about the model.*
75
+
76
  <details>
77
  <summary>Click to expand</summary> <br/>
78
 
79
+ **Developed by:** BigScience ([website](https://bigscience.huggingface.co))
 
80
 
81
+ * All collaborators are either volunteers or have an agreement with their employer. *(Further breakdown of participants forthcoming.)*
82
+
83
  **Model Type:** Transformer-based Language Model
84
 
85
  **Version:** 1.0.0
86
 
87
+ **Languages:** Multiple; see [training data](#training-data)
88
 
89
+ **License:** RAIL License v1.0 ([link](https://huggingface.co/spaces/bigscience/license))
90
 
91
+ **Release Date Estimate:** Monday, 11.July.2022
92
 
93
+ **Send Questions to:** bigscience-contact@googlegroups.com
94
 
95
+ **Cite as:** BigScience, _BigScience Language Open-science Open-access Multilingual (BLOOM) Language Model_. International, May 2021-May 2022
96
+
97
+ **Funded by:**
98
 
99
+ * The French government.
100
+
101
+ * Hugging Face ([website](https://huggingface.co)).
102
+
103
+ * Organizations of contributors. *(Further breakdown of organizations forthcoming.)*
104
 
105
  </details>
106
 
107
  ### Technical Specifications
108
  *This section provides information for people who work on model development.*
109
+
110
  <details>
111
  <summary>Click to expand</summary><br/>
112
 
113
+ Please see [the BLOOM training README](https://github.com/bigscience-workshop/bigscience/tree/master/train/tr11-176B-ml#readme) for full details on replicating training.
114
 
115
+ **Model Architecture:** Modified from Megatron-LM GPT2 (see [paper](https://arxiv.org/abs/1909.08053), [BLOOM Megatron code](https://github.com/bigscience-workshop/Megatron-DeepSpeed)):
116
 
117
+ * Decoder-only architecture
118
 
119
+ * Layer normalization applied to word embeddings layer (`StableEmbedding`; see [code](https://github.com/facebookresearch/bitsandbytes), [paper](https://arxiv.org/pdf/2110.02861.pdf))
120
 
121
+ * ALiBI positional encodings (see [paper](https://arxiv.org/pdf/2108.12409.pdf)), with GeLU activation functions
122
 
123
+ * 176 billion parameters:
124
 
125
+ * 70 layers, 112 attention heads
126
+
127
+ * Hidden layers are 14336-dimensional
128
 
129
+ * Sequence length of 2048 tokens used (see [BLOOM tokenizer](https://huggingface.co/bigscience/tokenizer), [tokenizer description](#tokenization))
130
 
131
+ **Objective Function:** Cross Entropy with mean reduction (see [API documentation](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss)).
132
+
133
+ **Compute infrastructure:** Jean Zay Public Supercomputer, provided by the French government (see [announcement](https://www.enseignementsup-recherche.gouv.fr/fr/signature-du-marche-d-acquisition-de-l-un-des-supercalculateurs-les-plus-puissants-d-europe-46733)).
134
 
135
+ * Hardware: 384 A100 80GB GPUs (48 nodes):
136
+
137
+ * Additional 32 A100 80GB GPUs (4 nodes) in reserve
138
 
139
+ * 8 GPUs per node Using NVLink 4 inter-gpu connects, 4 OmniPath links
140
 
141
+ * CPU: AMD
142
 
143
+ * CPU memory: 512GB per node
144
 
145
+ * GPU memory: 640GB per node
146
 
147
+ * Inter-node connect: Omni-Path Architecture (OPA)
148
 
149
+ * NCCL-communications network: a fully dedicated subnet
150
 
151
+ * Disc IO network: shared network with other types of nodes
152
 
153
+ * Software:
154
+
155
+ * Megatron-DeepSpeed ([Github link](https://github.com/bigscience-workshop/Megatron-DeepSpeed))
156
 
157
+ * DeepSpeed ([Github link](https://github.com/microsoft/DeepSpeed))
158
 
159
+ * PyTorch (pytorch-1.11 w/ CUDA-11.5; see [Github link](https://github.com/pytorch/pytorch))
160
 
161
+ * apex ([Github link](https://github.com/NVIDIA/apex))
162
 
163
 
164
  #### **Training**
166
 
167
  _In progress._
168
 
169
+ Current training logs: [Tensorboard link](https://huggingface.co/tensorboard/bigscience/tr11-176B-ml-logs/)
170
 
171
+ - Checkpoint size:
172
+
173
+ - Bf16 weights: 329GB
174
+
175
+ - Full checkpoint with optimizer states: 2.3TB
176
 
177
+ - Training throughput: About 150 TFLOP per GPU per second
178
 
179
+ - Number of epochs: 1 (*current target*)
180
 
181
+ - Dates:
182
+
183
+ - Started 11th March, 2022 11:42am PST
184
+
185
+ - Estimated end: 5th July, 2022
186
 
187
+ - Estimated cost of training: Equivalent of $2-5M in cloud computing (including preliminary experiments)
 
 
188
 
189
+ - Server training location: Île-de-France, France
190
 
191
+ #### **Tokenization**
192
+
193
+ The BLOOM tokenizer ([link](https://huggingface.co/bigscience/tokenizer)) is a learned subword tokenizer trained using:
194
+
195
+ - A byte-level Byte Pair Encoding (BPE) algorithm
196
 
197
+ - A simple pre-tokenization rule, no normalization
198
 
199
+ - A vocabulary size of 250,680
200
+
201
+ It was trained on a subset of a preliminary version of the corpus using alpha-weighting per language.
202
+
203
  </details>
204
 
205
 
208
  <details>
209
  <summary>Click to expand</summary><br/>
210
 
211
+ The training supercomputer, Jean Zay ([website](http://www.idris.fr/eng/jean-zay/jean-zay-presentation-eng.html)), uses mostly nuclear energy. The heat generated by it is reused for heating campus housing.
212
+
213
+ **Estimated carbon emissions:** *(Forthcoming upon completion of training.)*
 
 
214
 
215
+ **Estimated electricity usage:** *(Forthcoming upon completion of training.)*
216
 
 
 
217
 
218
+ </details>
219
  <p>&nbsp;</p>
220
 
221
  ## Uses
222
 
223
  *This section addresses questions around how the model is intended to be used, discusses the foreseeable users of the model (including those affected by the model), and describes uses that are considered out of scope or misuse of the model.
224
+ It provides information for anyone considering using the model or who is affected by the model.*
225
 
226
 
227
  <details>
228
  <summary>Click to expand</summary><br/>
229
 
230
+ ### Intended Use
231
 
232
  This model is being created in order to enable public research on large language models (LLMs). LLMs are intended to be used for language generation or as a pretrained base model that can be further fine-tuned for specific tasks. Use cases below are not exhaustive.
233
 
235
 
236
  - Text generation
237
 
238
+ - Exploring characteristics of language generated by a language model
239
 
240
+ - Examples: Cloze tests, counterfactuals, generations with reframings
241
 
242
  #### **Downstream Use**
243
 
244
+ - Tasks that leverage language models include: Information Extraction, Question Answering, Summarization
245
 
246
  ### Misuse and Out-of-scope Use
 
247
  *This section addresses what users ought not do with the model.*
248
 
249
+ See the [BLOOM License](https://huggingface.co/spaces/bigscience/license), Attachment A, for detailed usage restrictions. The below list is non-exhaustive, but lists some easily foreseeable problematic use cases.
250
 
251
  #### **Out-of-scope Uses**
252
 
253
+ Using the model in [high-stakes](#high-stakes) settings is out of scope for this model.  The model is not designed for [critical decisions](#critical-decisions) nor uses with any material consequences on an individual's livelihood or wellbeing. The model outputs content that appears factual but is not correct.
254
 
255
+ ##### Out-of-scope Uses Include:
256
 
257
+ - Usage in biomedical domains, political and legal domains, or finance domains
258
 
259
+ - Usage for evaluating or scoring individuals, such as for employment, education, or credit
260
 
261
+ - Applying the model for critical automatic decisions, generating factual content, creating reliable summaries, or generating predictions that must be correct
262
 
263
  #### **Misuse**
264
 
265
+ Intentionally using the model for harm, violating [human rights](#human-rights), or other kinds of malicious activities, is a misuse of this model. This includes:
266
 
267
  - Spam generation
268
 
272
 
273
  - Harassment and abuse
274
 
275
+ - [Deception](#deception)
276
 
277
  - Unconsented impersonation and imitation
278
 
279
  - Unconsented surveillance
280
 
281
+ - Generating content without attribution to the model, as specified in the [RAIL License, Use Restrictions](https://huggingface.co/spaces/bigscience/license)
 
282
 
283
  ### Intended Users
284
 
300
 
301
  #### Indirect Users
302
 
303
+ - Users of derivatives created by Direct Users, such as those using software with an [intended use](#intended-use)
304
 
305
+ - Users of [Derivatives of the Model, as described in the License](https://huggingface.co/spaces/bigscience/license)
306
 
307
+ #### Others Affected (Parties Prenantes)
308
 
309
  - People and groups referred to by the LLM
310
 
311
  - People and groups exposed to outputs of, or decisions based on, the LLM
312
 
313
  - People and groups whose original work is included in the LLM
314
+
315
  </details>
316
  <p>&nbsp;</p>
317
 
319
  *This section provides a high-level overview of the training data. It is relevant for anyone who wants to know the basics of what the model is learning.*
320
 
321
 
 
322
  <details>
323
  <summary>Click to expand</summary><br/>
324
 
325
+ Details for each dataset are provided in individual [Data Cards](https://huggingface.co/spaces/bigscience/BigScienceCorpus).
326
 
327
  Training data includes:
328
 
329
+ - 45 natural languages
330
 
331
+ - 12 programming languages
332
 
333
+ - In 1.5TB of pre-processed text, converted into 350B unique tokens (see [the tokenizer section](#tokenization) for more.)
334
 
 
335
 
336
  #### **Languages**
337
+
338
  The pie chart shows the distribution of languages in training data.
339
 
340
  ![pie chart showing the distribution of languages in training data](https://github.com/bigscience-workshop/model_card/blob/main/assets/data/pie_chart.svg?raw=true)
341
 
342
 
 
 
343
  The following table shows the further distribution of Niger-Congo and Indic languages in the training data.
344
  <details>
345
  <summary>Click to expand</summary><br/>
407
  ## Risks and Limitations
408
  *This section identifies foreseeable harms and misunderstandings.*
409
 
 
 
410
  <details>
411
  <summary>Click to expand</summary><br/>
412
 
416
 
417
  - Contain stereotypes
418
 
419
+ - Contain [personal information](#personal-data-and-information)
 
420
 
421
  - Generate:
422
 
424
 
425
  - Discriminatory or prejudicial language
426
 
427
+ - Content that may not be appropriate for all settings, including sexual content
428
 
429
+ - Make errors, including producing incorrect information as if it were factual
430
 
431
+ - Generate irrelevant or repetitive outputs
432
  </details>
433
  <p>&nbsp;</p>
434
 
435
  ## Evaluation
436
+ *This section describes the evaluation protocols and provides the results.*
437
+
438
  <details>
439
  <summary>Click to expand</summary><br/>
440
 
441
  ### Metrics
442
+ *This section describes the different ways performance is calculated and why.*
443
+
 
 
444
  Includes:
445
 
446
  | Metric | Why chosen |
447
  |--------------------|--------------------------------------------------------------------|
448
+ | [Perplexity](#perplexity) | Standard metric for quantifying model improvements during training |
449
+ | Cross Entropy [Loss](#loss) | Standard objective for language models. |
 
 
450
 
451
+ And multiple different metrics for specific tasks. _(More evaluation metrics forthcoming upon completion of evaluation protocol.)_
452
 
453
  ### Factors
454
  *This section lists some different aspects of what BLOOM models. Its focus is on those aspects that are likely to give rise to high variance in model behavior.*
455
 
456
  - Language, such as English or Yoruba
457
+
458
  - Domain, such as newswire or stories
459
+
460
  - Demographic characteristics, such as gender or nationality
461
 
462
  ### Results
463
  *Results are based on the [Factors](#factors) and [Metrics](#metrics).*
464
 
465
+ **Train-time Evaluation:**
466
 
467
+ As of 25.May.2022, 15:00 PST:
468
+
469
+ - Training Loss: 2.0
470
+
471
+ - Validation Loss: 2.2
472
 
473
+ - Perplexity: 8.9
474
+
475
+ (More evaluation scores forthcoming at the end of model training.)
476
+
477
+ </details>
478
+ <p>&nbsp;</p>
479
 
480
  ## Recommendations
481
 
482
  *This section provides information on warnings and potential mitigations.*
483
 
484
 
 
485
  <details>
486
  <summary>Click to expand</summary><br/>
487
 
489
 
490
  - Users should be aware of [Risks and Limitations](#risks-and-limitations), and include an appropriate age disclaimer or blocking interface as necessary.
491
 
492
+ - Models pretrained with the LLM should include an updated Model Card.
493
 
494
  - Users of the model should provide mechanisms for those affected to provide feedback, such as an email address for comments.
495
 
505
  <details>
506
  <summary>Click to expand</summary><br/>
507
 
508
+ - <a name="loss">**Loss:**</a> A calculation of the difference between what the model has learned and what the data shows ("groundtruth"). The lower the loss, the better. The training process aims to minimize the loss.
509
 
510
+ - <a name="perplexity">**Perplexity:**</a> This is based on what the model estimates the probability of new data is. The lower the perplexity, the better. If the model is 100% correct at predicting the next token it will see, then the perplexity is 1. Mathematically this is calculated using entropy.
511
 
512
+ - <a name="high-stakes">**High-stakes settings:**</a> Such as those identified as "high-risk AI systems" and "unacceptable risk AI systems" in the European Union's proposed [Artificial Intelligence (AI) Act](https://artificialintelligenceact.eu/annexes/).
513
 
514
+ - <a name="critical-decisions">**Critical decisions:**</a> Such as those defined in [the United States' proposed Algorithmic Accountability Act](https://www.congress.gov/117/bills/s3572/BILLS-117s3572is.pdf).
515
 
516
+ - <a name="human-rights">**Human rights:**</a> Includes those rights defined in the [Universal Declaration of Human Rights](https://www.un.org/sites/un2.un.org/files/2021/03/udhr.pdf).
517
 
518
+ - <a name="personal-data-and-information">**Personal Data and Personal Information:**</a> Personal data and information is defined in multiple data protection regulations, such as "[personal data](https://gdpr-info.eu/issues/personal-data/)" in the [European Union's General Data Protection Regulation](https://gdpr-info.eu); and "personal information" in the Republic of South Africa's [Protection of Personal Information Act](https://www.gov.za/sites/default/files/gcis_document/201409/3706726-11act4of2013popi.pdf), The People's Republic of China's [Personal information protection law](http://en.npc.gov.cn.cdurl.cn/2021-12/29/c_694559.htm).
 
 
519
 
520
+ - <a name="sensitive-characteristics">**Sensitive characteristics:**</a> This includes specifically protected categories in human rights (see [UHDR, Article 2](https://www.un.org/sites/un2.un.org/files/2021/03/udhr.pdf)) and personal information regulation (see GDPR, [Article 9; Protection of Personal Information Act, Chapter 1](https://www.gov.za/sites/default/files/gcis_document/201409/3706726-11act4of2013popi.pdf))
521
 
522
+ - <a name="deception">**Deception:**</a> Doing something to intentionally mislead individuals to believe something that is false, such as by creating deadbots or chatbots on social media posing as real people, or generating text documents without making consumers aware that the text is machine generated.
523
 
524
  </details>
525
  <p>&nbsp;</p>
526
 
527
+ ## More Information
528
+
529
+ <details>
530
+ <summary>Click to expand</summary><br/>
531
+
532
+ ### Dataset Creation
533
+
534
+ Blog post detailing the design choices during the dataset creation: https://bigscience.huggingface.co/blog/building-a-tb-scale-multilingual-dataset-for-language-modeling
535
+
536
+ ### Technical Specifications
537
+
538
+ Blog post summarizing how the architecture, size, shape, and pre-training duration where selected: https://bigscience.huggingface.co/blog/what-language-model-to-train-if-you-have-two-million-gpu-hours
539
+
540
+ More details on the architecture/optimizer: https://github.com/bigscience-workshop/bigscience/tree/master/train/tr11-176B-ml
541
+
542
+ Blog post on the hardware/engineering side: https://bigscience.huggingface.co/blog/which-hardware-to-train-a-176b-parameters-model
543
+
544
+ Details on the distributed setup used for the training: https://github.com/bigscience-workshop/bigscience/tree/master/train/tr11-176B-ml
545
+
546
+ Tensorboard updated during the training: https://huggingface.co/bigscience/tr11-176B-ml-logs/tensorboard#scalars&tagFilter=loss
547
+
548
+ Insights on how to approach training, negative results: https://github.com/bigscience-workshop/bigscience/blob/master/train/lessons-learned.md
549
+
550
+ Details on the obstacles overcome during the preparation on the engineering side (instabilities, optimization of training throughput, so many technical tricks and questions): https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/chronicles.md
551
+
552
+ ### Initial Results
553
+
554
+ Initial prompting experiments using interim checkpoints: https://huggingface.co/spaces/bigscience/bloom-book
555
+
556
+ </details>
557
+ <p>&nbsp;</p>
558
+
559
  ## Model Card Authors
560
  *Ordered roughly chronologically and by amount of time spent.*
561
 
562
+ Margaret Mitchell, Giada Pistilli, Yacine Jernite, Ezinwanne Ozoani, Marissa Gerchick, Nazneen Rajani, Sasha Luccioni, Irene Solaiman, Maraim Masoud, Somaieh Nikpoor, Carlos Muñoz Ferrandis, Stas Bekman, Christopher Akiki, Danish Contractor, David Lansky, Angelina McMillan-Major, Tristan Thrush, Suzana Ilić, Gérard Dupont, Shayne Longpre, Manan Dey, Stella Biderman, Douwe Kiela, Emi Baylor, Teven Le Scao, Aaron Gokaslan, Julien Launay