meg HF staff commited on
Commit
45a9da2
·
1 Parent(s): b277c40

Copy+paste from updated model card, + warning.

Browse files
Files changed (1) hide show
  1. README.md +218 -115
README.md CHANGED
@@ -1,13 +1,63 @@
1
  ---
2
  license: bigscience-bloom-rail-1.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
4
 
5
  # <span style="color:red"><b>WARNING:</b> This is an <b>intermediary checkpoint</b>. It is not fully trained yet. You might want to use [Bloom-1B3](https://huggingface.co/bigscience/bloom-1b3) if you want a model that has completed training.</span>
6
 
7
- # <p>BLOOM LM<br/> _BigScience Large Open-source Open-access Multilingual Language Model_ <br/>Model Card</p>
8
- ![BigScience Logo](https://assets.website-files.com/6139f3cdcbbff3a68486761d/613cd8997b270da063e230c5_Tekengebied%201-p-500.png)
9
 
10
- Version 1.0 / 24.May.2022
 
11
 
12
  ## Table of Contents
13
  1. [Model Details](#model-details)
@@ -17,84 +67,100 @@ Version 1.0 / 24.May.2022
17
  5. [Evaluation](#evaluation)
18
  6. [Recommendations](#recommendations)
19
  7. [Glossary and Calculations](#glossary-and-calculations)
20
- 8. [Model Card Authors](#model-card-authors)
 
21
 
22
  ## Model Details
23
 
24
  ### Basics
25
  *This section provides information for anyone who wants to know about the model.*
 
26
  <details>
27
  <summary>Click to expand</summary> <br/>
28
 
29
- **Developed by:** [BigScience](https://bigscience.huggingface.co)
30
- * All collaborators are either volunteers or have an agreement with their employer. [Further breakdown of participants forthcoming.]
31
 
 
 
32
  **Model Type:** Transformer-based Language Model
33
 
34
  **Version:** 1.0.0
35
 
36
- **Languages:** Multiple; see [training data](#training-data).
37
 
38
- **License:** [RAIL License v1.0](https://docs.google.com/document/d/10NMjEKjxR7mrZ5CvugGBVaF6nPEgNxFBIbkH7z5HB-0/edit#)
39
 
40
- **Released:** [Forthcoming]
41
 
42
- **Send questions to:** bigscience-contact@googlegroups.com
43
 
44
- **Cite as:** [BigScience Workshop](https://bigscience.huggingface.co), BigScience Language Open-source Open-access Multilingual (BLOOM). International, May 2021-May 2022.
 
 
45
 
46
- **Funded by:** The French government, [Hugging Face](https://huggingface.co), and the organizations of contributors. [Further breakdown of organizations forthcoming.]
 
 
 
 
47
 
48
  </details>
49
 
50
  ### Technical Specifications
51
  *This section provides information for people who work on model development.*
 
52
  <details>
53
  <summary>Click to expand</summary><br/>
54
 
55
- *Please see [the BLOOM training README](https://github.com/bigscience-workshop/bigscience/tree/master/train/tr11-176B-ml#readme) for full details.*
56
 
57
- **Model Architecture:** Modified from Megatron-LM GPT2 ([paper link](https://arxiv.org/abs/1909.08053)):
58
 
59
- 1. Layer normalization applied to word embedding layer
60
 
61
- 2. [ALiBI positional encodings](https://arxiv.org/pdf/2108.12409.pdf)
62
 
63
- **Objective Function:** [Cross Entropy with mean reduction](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss)
64
 
65
- **Number of Parameters:** 350M parameters; 24 layers, 16 attention heads
66
 
67
- #### **Infrastructure**
68
-
69
- Compute Infrastructure: [Jean Zay](http://www.idris.fr/eng/jean-zay/jean-zay-presentation-eng.html) Public Supercomputer, provided by the French government
70
 
71
- Hardware: 384 A100 80GB GPUs (48 nodes)
72
 
73
- - Additional 32 A100 80GB GPUs (4 nodes) in reserve
74
 
75
- - 8 GPUs per node Using NVLink 4 inter-gpu connects, 4 OmniPath links
 
 
76
 
77
- - CPU: AMD
 
 
 
 
78
 
79
- - CPU memory: 512GB per node
80
 
81
- - GPU memory: 640GB per node
82
 
83
- - Inter-node connect: Omni-Path Architecture (OPA)
84
 
85
- - NCCL-communications network: a fully dedicated subnet
86
 
87
- - Disc IO network: shared network with other types of nodes
88
 
89
- Software:
90
 
91
- - [Megatron-DeepSpeed](https://github.com/bigscience-workshop/Megatron-DeepSpeed), BigScience fork
 
 
92
 
93
- - [DeepSpeed](https://github.com/microsoft/DeepSpeed)
94
 
95
- - [PyTorch](https://github.com/pytorch/pytorch)-1.11 w/ CUDA-11.5
96
 
97
- - [apex](https://github.com/NVIDIA/apex)
98
 
99
 
100
  #### **Training**
@@ -102,24 +168,40 @@ Software:
102
 
103
  _In progress._
104
 
105
- Checkpoint size:
 
 
 
 
 
 
106
 
107
- - fp16 weights: 1.04GB
108
 
 
 
 
 
 
109
 
110
- Training throughput: About 150 TFLOP per GPU per second
111
 
112
- Number of steps: 555750
113
 
114
- Dates:
115
- - Started: to determine
116
- - Ended: to determine
117
 
 
 
 
 
 
118
 
119
- Estimated cost of training: Unknown
120
 
121
- Server training location: Ile-de-France, France
122
 
 
 
123
  </details>
124
 
125
 
@@ -128,29 +210,26 @@ Server training location: Ile-de-France, France
128
  <details>
129
  <summary>Click to expand</summary><br/>
130
 
131
- [More forthcoming when training has completed.]
132
-
133
- The training supercomputer, [Jean Zay]((http://www.idris.fr/eng/jean-zay/jean-zay-presentation-eng.html)), uses mostly nuclear energy.
134
-
135
- The heat generated by it is reused for heating campus housing.
136
 
137
- * Estimated carbon emissions: [Forthcoming]
 
 
138
 
139
- * Estimated electricity usage: [Forthcoming]
140
- </details>
141
 
 
142
  <p>&nbsp;</p>
143
 
144
  ## Uses
145
 
146
  *This section addresses questions around how the model is intended to be used, discusses the foreseeable users of the model (including those affected by the model), and describes uses that are considered out of scope or misuse of the model.
147
- It provides information for anyone considering using the model, or who is affected by the model.*
148
 
149
 
150
  <details>
151
  <summary>Click to expand</summary><br/>
152
 
153
- ### Intended use
154
 
155
  This model is being created in order to enable public research on large language models (LLMs). LLMs are intended to be used for language generation or as a pretrained base model that can be further fine-tuned for specific tasks. Use cases below are not exhaustive.
156
 
@@ -158,35 +237,34 @@ This model is being created in order to enable public research on large language
158
 
159
  - Text generation
160
 
161
- - Exploring characteristics of language generated by a language model.
162
 
163
- - Examples: Cloze tests, counterfactuals, generations with reframings.
164
 
165
  #### **Downstream Use**
166
 
167
- - Tasks that leverage language models include: Information Extraction, Question Answering, Summarization.
168
 
169
  ### Misuse and Out-of-scope Use
170
-
171
  *This section addresses what users ought not do with the model.*
172
 
173
- See the [LLM LICENSE ](https://docs.google.com/document/d/10NMjEKjxR7mrZ5CvugGBVaF6nPEgNxFBIbkH7z5HB-0/edit), Attachment A, for detailed usage restrictions. The below list is non-exhaustive, but lists some easily foreseeable problematic use cases.
174
 
175
  #### **Out-of-scope Uses**
176
 
177
- Using the model in [high-stakes](#glossary-and-calculations) settings is out of scope for this model. The model is not designed for [critical decisions](#glossary-and-calculations) nor uses with any material consequences on an individual's livelihood or wellbeing. The model outputs content that appears factual but is not correct.
178
 
179
- ##### Out-of-scope uses include:
180
 
181
- - Usage in biomedical domains, political and legal domains, or finance domains.
182
 
183
- - Usage for evaluating or scoring individuals, such as for employment, education, or credit.
184
 
185
- - Applying the model for critical automatic decisions, generating factual content, creating reliable summaries, or generating predictions that must be correct.
186
 
187
  #### **Misuse**
188
 
189
- Intentionally using the model for harm, violating rights, or other kinds of malicious activities is a misuse of this model. This includes:
190
 
191
  - Spam generation
192
 
@@ -196,14 +274,13 @@ Intentionally using the model for harm, violating rights, or other kinds of mali
196
 
197
  - Harassment and abuse
198
 
199
- - Deception
200
 
201
  - Unconsented impersonation and imitation
202
 
203
  - Unconsented surveillance
204
 
205
-
206
- - Generating content without attribution to the model, as specified in the [RAIL License, Use Restrictions](https://docs.google.com/document/d/10NMjEKjxR7mrZ5CvugGBVaF6nPEgNxFBIbkH7z5HB-0/edit#heading=h.3blioxkgzsje).
207
 
208
  ### Intended Users
209
 
@@ -225,17 +302,18 @@ Intentionally using the model for harm, violating rights, or other kinds of mali
225
 
226
  #### Indirect Users
227
 
228
- - Users of derivatives created by Direct Users, such as those using software with an [intended use](#intended-use).
229
 
230
- - Users of [Derivatives of the Model, as described in the License](https://docs.google.com/document/d/117RhytMYC9HS-1NmWHEn9XBK7vJ5kdv9OcG6AV69Vec/edit#bookmark=id.pvl8781qfes3).
231
 
232
- #### Others Affected (Parties prenantes)
233
 
234
  - People and groups referred to by the LLM
235
 
236
  - People and groups exposed to outputs of, or decisions based on, the LLM
237
 
238
  - People and groups whose original work is included in the LLM
 
239
  </details>
240
  <p>&nbsp;</p>
241
 
@@ -243,30 +321,27 @@ Intentionally using the model for harm, violating rights, or other kinds of mali
243
  *This section provides a high-level overview of the training data. It is relevant for anyone who wants to know the basics of what the model is learning.*
244
 
245
 
246
-
247
  <details>
248
  <summary>Click to expand</summary><br/>
249
 
250
- *Details for each dataset are provided in individual [Data Cards](https://huggingface.co/spaces/bigscience/BigScienceCorpus).*
251
 
252
  Training data includes:
253
 
254
- - 45 natural languages.
255
 
256
- - 12 programming languages.
257
 
258
- - In 1.5TB of pre-processed text, converted into 350B unique tokens.
259
 
260
- See the [Model README, Datasets for more](https://github.com/bigscience-workshop/bigscience/tree/master/train/tr11-176B-ml#datasets).
261
 
262
  #### **Languages**
 
263
  The pie chart shows the distribution of languages in training data.
264
 
265
  ![pie chart showing the distribution of languages in training data](https://github.com/bigscience-workshop/model_card/blob/main/assets/data/pie_chart.svg?raw=true)
266
 
267
 
268
-
269
-
270
  The following table shows the further distribution of Niger-Congo and Indic languages in the training data.
271
  <details>
272
  <summary>Click to expand</summary><br/>
@@ -334,8 +409,6 @@ The following table shows the distribution of programming languages.
334
  ## Risks and Limitations
335
  *This section identifies foreseeable harms and misunderstandings.*
336
 
337
-
338
-
339
  <details>
340
  <summary>Click to expand</summary><br/>
341
 
@@ -345,8 +418,7 @@ Model may:
345
 
346
  - Contain stereotypes
347
 
348
- - Contain personal information
349
-
350
 
351
  - Generate:
352
 
@@ -354,65 +426,64 @@ Model may:
354
 
355
  - Discriminatory or prejudicial language
356
 
357
- - Content that may not be appropriate for all settings, including sexual content.
358
 
359
- - Make errors, including producing incorrect information as if it were factual.
360
 
361
- - Generate irrelevant or repetitive outputs.
362
  </details>
363
  <p>&nbsp;</p>
364
 
365
  ## Evaluation
 
 
366
  <details>
367
  <summary>Click to expand</summary><br/>
368
 
369
  ### Metrics
370
- *This section describes the different ways performance is calculated, and why.*
371
-
372
- [More Forthcoming]
373
-
374
  Includes:
375
 
376
  | Metric | Why chosen |
377
  |--------------------|--------------------------------------------------------------------|
378
- | F1 | Standard for benchmarking |
379
- | Accuracy | Standard for benchmarking |
380
- | Perplexity | Standard metric for quantifying model improvements during training |
381
- | Cross Entropy Loss | Standard objective for language models |
382
 
383
- And multiple different metrics for specific tasks.
384
 
385
  ### Factors
386
  *This section lists some different aspects of what BLOOM models. Its focus is on those aspects that are likely to give rise to high variance in model behavior.*
387
 
388
  - Language, such as English or Yoruba
 
389
  - Domain, such as newswire or stories
 
390
  - Demographic characteristics, such as gender or nationality
391
 
392
  ### Results
393
  *Results are based on the [Factors](#factors) and [Metrics](#metrics).*
394
 
395
- **Train-time evaluation:**
396
 
397
- As of 19.May.2022, 18:00:
398
 
399
- - Training Loss: 2.04
400
 
401
- - Validation Loss: 2.21
402
 
403
- - Perplexity: 9.15
404
 
405
- [More evaluation types forthcoming at the end of model training.]
406
- </details>
407
 
408
- <BR/>
 
409
 
410
  ## Recommendations
411
 
412
  *This section provides information on warnings and potential mitigations.*
413
 
414
 
415
-
416
  <details>
417
  <summary>Click to expand</summary><br/>
418
 
@@ -420,7 +491,7 @@ As of 19.May.2022, 18:00:
420
 
421
  - Users should be aware of [Risks and Limitations](#risks-and-limitations), and include an appropriate age disclaimer or blocking interface as necessary.
422
 
423
- - Models pre-trained with the LLM should include an updated Model Card.
424
 
425
  - Users of the model should provide mechanisms for those affected to provide feedback, such as an email address for comments.
426
 
@@ -436,27 +507,59 @@ As of 19.May.2022, 18:00:
436
  <details>
437
  <summary>Click to expand</summary><br/>
438
 
439
- - **Loss:** A calculation of the difference between what the model has learned and what the data shows ("groundtruth"). The lower the loss, the better. The training process aims to minimize the loss.
440
 
 
441
 
442
- - **Perplexity:** This is based on what the model estimates the probability of new data is. The lower the perplexity, the better. If the model is 100% correct at predicting the next token it will see, then the perplexity is 1. Mathematically this is calculated using entropy.
443
 
444
- - **High-stakes settings:** Such as those identified as "high-risk AI systems" and "unacceptable risk AI systems" in the European Union's proposed [Artificial Intelligence (AI) Act](https://artificialintelligenceact.eu/annexes/).
445
 
446
- - **Critical decisions**: Such as those defined in [the United States' proposed Algorithmic Accountability Act](https://www.congress.gov/117/bills/s3572/BILLS-117s3572is.pdf).
447
 
448
- - **Human Rights**: Includes those rights defined in the [Universal Declaration of Human Rights](https://www.un.org/sites/un2.un.org/files/2021/03/udhr.pdf).
449
-
450
- - **Personal Data and Information**: Personal data and information is defined in multiple data protection regulations, such as "[personal data](https://gdpr-info.eu/issues/personal-data/)" in the [European Union's General Data Protection Regulation](https://gdpr-info.eu); and "personal information" in the Republic of South Africa's [Protection of Personal Information Act](https://www.gov.za/sites/default/files/gcis_document/201409/3706726-11act4of2013popi.pdf), The People's Republic of China's [Personal information protection law](http://en.npc.gov.cn.cdurl.cn/2021-12/29/c_694559.htm).
451
 
452
- - **Sensitive Characteristics**: This includes specifically protected categories in human rights (see [UHDR, Article 2](https://www.un.org/sites/un2.un.org/files/2021/03/udhr.pdf)) and personal information regulation (see GDPR, [Article 9; Protection of Personal Information Act, Chapter 1](https://www.gov.za/sites/default/files/gcis_document/201409/3706726-11act4of2013popi.pdf))
453
 
454
- - **Deception:** Doing something to intentionally mislead individuals to believe something that is false, such as by creating deadbots or chatbots on social media posing as real people, or generating text documents without making consumers aware that the text is machine generated.
455
 
456
  </details>
457
  <p>&nbsp;</p>
458
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
459
  ## Model Card Authors
460
  *Ordered roughly chronologically and by amount of time spent.*
461
 
462
- Margaret Mitchell, Giada Pistilli, Yacine Jernite, Ezinwanne Ozoani, Marissa Gerchick, Nazneen Rajani, Sasha Luccioni, Irene Solaiman, Maraim Masoud, Somaieh Nikpoor, Carlos Muñoz Ferrandis, Stas Bekman, Danish Contractor, David Lansky, Angelina McMillan-Major, Tristan Thrush, Suzana Ilić, Gérard Dupont, Shayne Longpre, Manan Dey, Stella Biderman, Douwe Kiela, Emi Baylor, Teven Le Scao, Aaron Gokaslan, Julien Launay
 
 
1
  ---
2
  license: bigscience-bloom-rail-1.0
3
+ language:
4
+ - ak
5
+ - ar
6
+ - as
7
+ - bm
8
+ - bn
9
+ - ca
10
+ - code
11
+ - en
12
+ - es
13
+ - eu
14
+ - fon
15
+ - fr
16
+ - gu
17
+ - hi
18
+ - id
19
+ - ig
20
+ - ki
21
+ - kn
22
+ - lg
23
+ - ln
24
+ - ml
25
+ - mr
26
+ - ne
27
+ - nso
28
+ - ny
29
+ - or
30
+ - pa
31
+ - pt
32
+ - rn
33
+ - rw
34
+ - sn
35
+ - st
36
+ - sw
37
+ - ta
38
+ - te
39
+ - tn
40
+ - ts
41
+ - tum
42
+ - tw
43
+ - ur
44
+ - vi
45
+ - wo
46
+ - xh
47
+ - yo
48
+ - zh
49
+ - zhs
50
+ - zht
51
+ - zu
52
  ---
53
 
54
  # <span style="color:red"><b>WARNING:</b> This is an <b>intermediary checkpoint</b>. It is not fully trained yet. You might want to use [Bloom-1B3](https://huggingface.co/bigscience/bloom-1b3) if you want a model that has completed training.</span>
55
 
56
+ # <p>BLOOM LM<br/> _BigScience Large Open-science Open-access Multilingual Language Model_ <br/>Model Card</p>
57
+ <img src="https://assets.website-files.com/6139f3cdcbbff3a68486761d/613cd8997b270da063e230c5_Tekengebied%201-p-500.png" alt="BigScience Logo" width="200"/>
58
 
59
+
60
+ Version 1.0 / 26.May.2022
61
 
62
  ## Table of Contents
63
  1. [Model Details](#model-details)
 
67
  5. [Evaluation](#evaluation)
68
  6. [Recommendations](#recommendations)
69
  7. [Glossary and Calculations](#glossary-and-calculations)
70
+ 8. [More Information](#more-information)
71
+ 9. [Model Card Authors](#model-card-authors)
72
 
73
  ## Model Details
74
 
75
  ### Basics
76
  *This section provides information for anyone who wants to know about the model.*
77
+
78
  <details>
79
  <summary>Click to expand</summary> <br/>
80
 
81
+ **Developed by:** BigScience ([website](https://bigscience.huggingface.co))
 
82
 
83
+ * All collaborators are either volunteers or have an agreement with their employer. *(Further breakdown of participants forthcoming.)*
84
+
85
  **Model Type:** Transformer-based Language Model
86
 
87
  **Version:** 1.0.0
88
 
89
+ **Languages:** Multiple; see [training data](#training-data)
90
 
91
+ **License:** RAIL License v1.0 ([link](https://huggingface.co/spaces/bigscience/license))
92
 
93
+ **Release Date Estimate:** Monday, 11.July.2022
94
 
95
+ **Send Questions to:** bigscience-contact@googlegroups.com
96
 
97
+ **Cite as:** BigScience, _BigScience Language Open-science Open-access Multilingual (BLOOM) Language Model_. International, May 2021-May 2022
98
+
99
+ **Funded by:**
100
 
101
+ * The French government.
102
+
103
+ * Hugging Face ([website](https://huggingface.co)).
104
+
105
+ * Organizations of contributors. *(Further breakdown of organizations forthcoming.)*
106
 
107
  </details>
108
 
109
  ### Technical Specifications
110
  *This section provides information for people who work on model development.*
111
+
112
  <details>
113
  <summary>Click to expand</summary><br/>
114
 
115
+ Please see [the BLOOM training README](https://github.com/bigscience-workshop/bigscience/tree/master/train/tr11-176B-ml#readme) for full details on replicating training.
116
 
117
+ **Model Architecture:** Modified from Megatron-LM GPT2 (see [paper](https://arxiv.org/abs/1909.08053), [BLOOM Megatron code](https://github.com/bigscience-workshop/Megatron-DeepSpeed)):
118
 
119
+ * Decoder-only architecture
120
 
121
+ * Layer normalization applied to word embeddings layer (`StableEmbedding`; see [code](https://github.com/facebookresearch/bitsandbytes), [paper](https://arxiv.org/pdf/2110.02861.pdf))
122
 
123
+ * ALiBI positional encodings (see [paper](https://arxiv.org/pdf/2108.12409.pdf)), with GeLU activation functions
124
 
125
+ * 176 billion parameters:
126
 
127
+ * 70 layers, 112 attention heads
 
 
128
 
129
+ * Hidden layers are 14336-dimensional
130
 
131
+ * Sequence length of 2048 tokens used (see [BLOOM tokenizer](https://huggingface.co/bigscience/tokenizer), [tokenizer description](#tokenization))
132
 
133
+ **Objective Function:** Cross Entropy with mean reduction (see [API documentation](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss)).
134
+
135
+ **Compute infrastructure:** Jean Zay Public Supercomputer, provided by the French government (see [announcement](https://www.enseignementsup-recherche.gouv.fr/fr/signature-du-marche-d-acquisition-de-l-un-des-supercalculateurs-les-plus-puissants-d-europe-46733)).
136
 
137
+ * Hardware: 384 A100 80GB GPUs (48 nodes):
138
+
139
+ * Additional 32 A100 80GB GPUs (4 nodes) in reserve
140
+
141
+ * 8 GPUs per node Using NVLink 4 inter-gpu connects, 4 OmniPath links
142
 
143
+ * CPU: AMD
144
 
145
+ * CPU memory: 512GB per node
146
 
147
+ * GPU memory: 640GB per node
148
 
149
+ * Inter-node connect: Omni-Path Architecture (OPA)
150
 
151
+ * NCCL-communications network: a fully dedicated subnet
152
 
153
+ * Disc IO network: shared network with other types of nodes
154
 
155
+ * Software:
156
+
157
+ * Megatron-DeepSpeed ([Github link](https://github.com/bigscience-workshop/Megatron-DeepSpeed))
158
 
159
+ * DeepSpeed ([Github link](https://github.com/microsoft/DeepSpeed))
160
 
161
+ * PyTorch (pytorch-1.11 w/ CUDA-11.5; see [Github link](https://github.com/pytorch/pytorch))
162
 
163
+ * apex ([Github link](https://github.com/NVIDIA/apex))
164
 
165
 
166
  #### **Training**
 
168
 
169
  _In progress._
170
 
171
+ Current training logs: [Tensorboard link](https://huggingface.co/tensorboard/bigscience/tr11-176B-ml-logs/)
172
+
173
+ - Checkpoint size:
174
+
175
+ - Bf16 weights: 329GB
176
+
177
+ - Full checkpoint with optimizer states: 2.3TB
178
 
179
+ - Training throughput: About 150 TFLOP per GPU per second
180
 
181
+ - Number of epochs: 1 (*current target*)
182
+
183
+ - Dates:
184
+
185
+ - Started 11th March, 2022 11:42am PST
186
 
187
+ - Estimated end: 5th July, 2022
188
 
189
+ - Estimated cost of training: Equivalent of $2-5M in cloud computing (including preliminary experiments)
190
 
191
+ - Server training location: Île-de-France, France
 
 
192
 
193
+ #### **Tokenization**
194
+
195
+ The BLOOM tokenizer ([link](https://huggingface.co/bigscience/tokenizer)) is a learned subword tokenizer trained using:
196
+
197
+ - A byte-level Byte Pair Encoding (BPE) algorithm
198
 
199
+ - A simple pre-tokenization rule, no normalization
200
 
201
+ - A vocabulary size of 250,680
202
 
203
+ It was trained on a subset of a preliminary version of the corpus using alpha-weighting per language.
204
+
205
  </details>
206
 
207
 
 
210
  <details>
211
  <summary>Click to expand</summary><br/>
212
 
213
+ The training supercomputer, Jean Zay ([website](http://www.idris.fr/eng/jean-zay/jean-zay-presentation-eng.html)), uses mostly nuclear energy. The heat generated by it is reused for heating campus housing.
 
 
 
 
214
 
215
+ **Estimated carbon emissions:** *(Forthcoming upon completion of training.)*
216
+
217
+ **Estimated electricity usage:** *(Forthcoming upon completion of training.)*
218
 
 
 
219
 
220
+ </details>
221
  <p>&nbsp;</p>
222
 
223
  ## Uses
224
 
225
  *This section addresses questions around how the model is intended to be used, discusses the foreseeable users of the model (including those affected by the model), and describes uses that are considered out of scope or misuse of the model.
226
+ It provides information for anyone considering using the model or who is affected by the model.*
227
 
228
 
229
  <details>
230
  <summary>Click to expand</summary><br/>
231
 
232
+ ### Intended Use
233
 
234
  This model is being created in order to enable public research on large language models (LLMs). LLMs are intended to be used for language generation or as a pretrained base model that can be further fine-tuned for specific tasks. Use cases below are not exhaustive.
235
 
 
237
 
238
  - Text generation
239
 
240
+ - Exploring characteristics of language generated by a language model
241
 
242
+ - Examples: Cloze tests, counterfactuals, generations with reframings
243
 
244
  #### **Downstream Use**
245
 
246
+ - Tasks that leverage language models include: Information Extraction, Question Answering, Summarization
247
 
248
  ### Misuse and Out-of-scope Use
 
249
  *This section addresses what users ought not do with the model.*
250
 
251
+ See the [BLOOM License](https://huggingface.co/spaces/bigscience/license), Attachment A, for detailed usage restrictions. The below list is non-exhaustive, but lists some easily foreseeable problematic use cases.
252
 
253
  #### **Out-of-scope Uses**
254
 
255
+ Using the model in [high-stakes](#high-stakes) settings is out of scope for this model.  The model is not designed for [critical decisions](#critical-decisions) nor uses with any material consequences on an individual's livelihood or wellbeing. The model outputs content that appears factual but is not correct.
256
 
257
+ ##### Out-of-scope Uses Include:
258
 
259
+ - Usage in biomedical domains, political and legal domains, or finance domains
260
 
261
+ - Usage for evaluating or scoring individuals, such as for employment, education, or credit
262
 
263
+ - Applying the model for critical automatic decisions, generating factual content, creating reliable summaries, or generating predictions that must be correct
264
 
265
  #### **Misuse**
266
 
267
+ Intentionally using the model for harm, violating [human rights](#human-rights), or other kinds of malicious activities, is a misuse of this model. This includes:
268
 
269
  - Spam generation
270
 
 
274
 
275
  - Harassment and abuse
276
 
277
+ - [Deception](#deception)
278
 
279
  - Unconsented impersonation and imitation
280
 
281
  - Unconsented surveillance
282
 
283
+ - Generating content without attribution to the model, as specified in the [RAIL License, Use Restrictions](https://huggingface.co/spaces/bigscience/license)
 
284
 
285
  ### Intended Users
286
 
 
302
 
303
  #### Indirect Users
304
 
305
+ - Users of derivatives created by Direct Users, such as those using software with an [intended use](#intended-use)
306
 
307
+ - Users of [Derivatives of the Model, as described in the License](https://huggingface.co/spaces/bigscience/license)
308
 
309
+ #### Others Affected (Parties Prenantes)
310
 
311
  - People and groups referred to by the LLM
312
 
313
  - People and groups exposed to outputs of, or decisions based on, the LLM
314
 
315
  - People and groups whose original work is included in the LLM
316
+
317
  </details>
318
  <p>&nbsp;</p>
319
 
 
321
  *This section provides a high-level overview of the training data. It is relevant for anyone who wants to know the basics of what the model is learning.*
322
 
323
 
 
324
  <details>
325
  <summary>Click to expand</summary><br/>
326
 
327
+ Details for each dataset are provided in individual [Data Cards](https://huggingface.co/spaces/bigscience/BigScienceCorpus).
328
 
329
  Training data includes:
330
 
331
+ - 45 natural languages
332
 
333
+ - 12 programming languages
334
 
335
+ - In 1.5TB of pre-processed text, converted into 350B unique tokens (see [the tokenizer section](#tokenization) for more.)
336
 
 
337
 
338
  #### **Languages**
339
+
340
  The pie chart shows the distribution of languages in training data.
341
 
342
  ![pie chart showing the distribution of languages in training data](https://github.com/bigscience-workshop/model_card/blob/main/assets/data/pie_chart.svg?raw=true)
343
 
344
 
 
 
345
  The following table shows the further distribution of Niger-Congo and Indic languages in the training data.
346
  <details>
347
  <summary>Click to expand</summary><br/>
 
409
  ## Risks and Limitations
410
  *This section identifies foreseeable harms and misunderstandings.*
411
 
 
 
412
  <details>
413
  <summary>Click to expand</summary><br/>
414
 
 
418
 
419
  - Contain stereotypes
420
 
421
+ - Contain [personal information](#personal-data-and-information)
 
422
 
423
  - Generate:
424
 
 
426
 
427
  - Discriminatory or prejudicial language
428
 
429
+ - Content that may not be appropriate for all settings, including sexual content
430
 
431
+ - Make errors, including producing incorrect information as if it were factual
432
 
433
+ - Generate irrelevant or repetitive outputs
434
  </details>
435
  <p>&nbsp;</p>
436
 
437
  ## Evaluation
438
+ *This section describes the evaluation protocols and provides the results.*
439
+
440
  <details>
441
  <summary>Click to expand</summary><br/>
442
 
443
  ### Metrics
444
+ *This section describes the different ways performance is calculated and why.*
445
+
 
 
446
  Includes:
447
 
448
  | Metric | Why chosen |
449
  |--------------------|--------------------------------------------------------------------|
450
+ | [Perplexity](#perplexity) | Standard metric for quantifying model improvements during training |
451
+ | Cross Entropy [Loss](#loss) | Standard objective for language models. |
 
 
452
 
453
+ And multiple different metrics for specific tasks. _(More evaluation metrics forthcoming upon completion of evaluation protocol.)_
454
 
455
  ### Factors
456
  *This section lists some different aspects of what BLOOM models. Its focus is on those aspects that are likely to give rise to high variance in model behavior.*
457
 
458
  - Language, such as English or Yoruba
459
+
460
  - Domain, such as newswire or stories
461
+
462
  - Demographic characteristics, such as gender or nationality
463
 
464
  ### Results
465
  *Results are based on the [Factors](#factors) and [Metrics](#metrics).*
466
 
467
+ **Train-time Evaluation:**
468
 
469
+ As of 25.May.2022, 15:00 PST:
470
 
471
+ - Training Loss: 2.0
472
 
473
+ - Validation Loss: 2.2
474
 
475
+ - Perplexity: 8.9
476
 
477
+ (More evaluation scores forthcoming at the end of model training.)
 
478
 
479
+ </details>
480
+ <p>&nbsp;</p>
481
 
482
  ## Recommendations
483
 
484
  *This section provides information on warnings and potential mitigations.*
485
 
486
 
 
487
  <details>
488
  <summary>Click to expand</summary><br/>
489
 
 
491
 
492
  - Users should be aware of [Risks and Limitations](#risks-and-limitations), and include an appropriate age disclaimer or blocking interface as necessary.
493
 
494
+ - Models pretrained with the LLM should include an updated Model Card.
495
 
496
  - Users of the model should provide mechanisms for those affected to provide feedback, such as an email address for comments.
497
 
 
507
  <details>
508
  <summary>Click to expand</summary><br/>
509
 
510
+ - <a name="loss">**Loss:**</a> A calculation of the difference between what the model has learned and what the data shows ("groundtruth"). The lower the loss, the better. The training process aims to minimize the loss.
511
 
512
+ - <a name="perplexity">**Perplexity:**</a> This is based on what the model estimates the probability of new data is. The lower the perplexity, the better. If the model is 100% correct at predicting the next token it will see, then the perplexity is 1. Mathematically this is calculated using entropy.
513
 
514
+ - <a name="high-stakes">**High-stakes settings:**</a> Such as those identified as "high-risk AI systems" and "unacceptable risk AI systems" in the European Union's proposed [Artificial Intelligence (AI) Act](https://artificialintelligenceact.eu/annexes/).
515
 
516
+ - <a name="critical-decisions">**Critical decisions:**</a> Such as those defined in [the United States' proposed Algorithmic Accountability Act](https://www.congress.gov/117/bills/s3572/BILLS-117s3572is.pdf).
517
 
518
+ - <a name="human-rights">**Human rights:**</a> Includes those rights defined in the [Universal Declaration of Human Rights](https://www.un.org/sites/un2.un.org/files/2021/03/udhr.pdf).
519
 
520
+ - <a name="personal-data-and-information">**Personal Data and Personal Information:**</a> Personal data and information is defined in multiple data protection regulations, such as "[personal data](https://gdpr-info.eu/issues/personal-data/)" in the [European Union's General Data Protection Regulation](https://gdpr-info.eu); and "personal information" in the Republic of South Africa's [Protection of Personal Information Act](https://www.gov.za/sites/default/files/gcis_document/201409/3706726-11act4of2013popi.pdf), The People's Republic of China's [Personal information protection law](http://en.npc.gov.cn.cdurl.cn/2021-12/29/c_694559.htm).
 
 
521
 
522
+ - <a name="sensitive-characteristics">**Sensitive characteristics:**</a> This includes specifically protected categories in human rights (see [UHDR, Article 2](https://www.un.org/sites/un2.un.org/files/2021/03/udhr.pdf)) and personal information regulation (see GDPR, [Article 9; Protection of Personal Information Act, Chapter 1](https://www.gov.za/sites/default/files/gcis_document/201409/3706726-11act4of2013popi.pdf))
523
 
524
+ - <a name="deception">**Deception:**</a> Doing something to intentionally mislead individuals to believe something that is false, such as by creating deadbots or chatbots on social media posing as real people, or generating text documents without making consumers aware that the text is machine generated.
525
 
526
  </details>
527
  <p>&nbsp;</p>
528
 
529
+ ## More Information
530
+
531
+ <details>
532
+ <summary>Click to expand</summary><br/>
533
+
534
+ ### Dataset Creation
535
+
536
+ Blog post detailing the design choices during the dataset creation: https://bigscience.huggingface.co/blog/building-a-tb-scale-multilingual-dataset-for-language-modeling
537
+
538
+ ### Technical Specifications
539
+
540
+ Blog post summarizing how the architecture, size, shape, and pre-training duration where selected: https://bigscience.huggingface.co/blog/what-language-model-to-train-if-you-have-two-million-gpu-hours
541
+
542
+ More details on the architecture/optimizer: https://github.com/bigscience-workshop/bigscience/tree/master/train/tr11-176B-ml
543
+
544
+ Blog post on the hardware/engineering side: https://bigscience.huggingface.co/blog/which-hardware-to-train-a-176b-parameters-model
545
+
546
+ Details on the distributed setup used for the training: https://github.com/bigscience-workshop/bigscience/tree/master/train/tr11-176B-ml
547
+
548
+ Tensorboard updated during the training: https://huggingface.co/bigscience/tr11-176B-ml-logs/tensorboard#scalars&tagFilter=loss
549
+
550
+ Insights on how to approach training, negative results: https://github.com/bigscience-workshop/bigscience/blob/master/train/lessons-learned.md
551
+
552
+ Details on the obstacles overcome during the preparation on the engineering side (instabilities, optimization of training throughput, so many technical tricks and questions): https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/chronicles.md
553
+
554
+ ### Initial Results
555
+
556
+ Initial prompting experiments using interim checkpoints: https://huggingface.co/spaces/bigscience/bloom-book
557
+
558
+ </details>
559
+ <p>&nbsp;</p>
560
+
561
  ## Model Card Authors
562
  *Ordered roughly chronologically and by amount of time spent.*
563
 
564
+ Margaret Mitchell, Giada Pistilli, Yacine Jernite, Ezinwanne Ozoani, Marissa Gerchick, Nazneen Rajani, Sasha Luccioni, Irene Solaiman, Maraim Masoud, Somaieh Nikpoor, Carlos Muñoz Ferrandis, Stas Bekman, Christopher Akiki, Danish Contractor, David Lansky, Angelina McMillan-Major, Tristan Thrush, Suzana Ilić, Gérard Dupont, Shayne Longpre, Manan Dey, Stella Biderman, Douwe Kiela, Emi Baylor, Teven Le Scao, Aaron Gokaslan, Julien Launay
565
+