ctlin commited on
Commit
3a01a1b
1 Parent(s): ff465fa
Files changed (1) hide show
  1. README.md +204 -1
README.md CHANGED
@@ -1,3 +1,206 @@
1
  ---
2
- license: gpl-3.0
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: bigscience-bloom-rail-1.0
3
+ language:
4
+ - en
5
+ - zht
6
+ pipeline_tag: text-generation
7
  ---
8
+
9
+ <h1 style='text-align: center '>BLOOM-zh</h1>
10
+ <h2 style='text-align: center '><em>Open-access Multilingual Language Model based on BLOOM</em> </h2>
11
+ <h3 style='text-align: center '>Model Card</h3>
12
+
13
+ Version 1.0 / 13.Feb.2023
14
+
15
+ This model is a close collaboration between MediaTek Research, National Academy for Educational Research, and CKIP lab, Acedemia Sinica.
16
+
17
+ ## Table of Contents
18
+ 1. [Model Details](#model-details)
19
+ 2. [Uses](#uses)
20
+ 3. [Training Data](#training-data)
21
+ 4. [Risks and Limitations](#risks-and-limitations)
22
+ 5. [Evaluation](#evaluation)
23
+ 6. [Recommendations](#recommendations)
24
+ 7. [Glossary and Calculations](#glossary-and-calculations)
25
+ 8. [More Information](#more-information)
26
+ 9. [Model Card Authors](#model-card-authors)
27
+
28
+ ## Model Details
29
+ BLOOM-zh is a modification from [BLOOM](https://huggingface.co/bigscience/bloom).
30
+ BLOOM-zh is trained extendedly on larger amounts of Traditional Chinese text data while it still maintains its pretrained English ability.
31
+
32
+
33
+ ### Basics
34
+ *This section provides information for anyone who wants to know about the model.*
35
+
36
+ <details>
37
+ <summary>Click to expand</summary> <br/>
38
+
39
+ **Developed by:** MediaTek Research ([website](https://www.mtkresearch.com/))
40
+
41
+ **Model Type:** Transformer-based Language Model
42
+
43
+ **Version:** 1.0.0
44
+
45
+ **Languages:** Multiple; see [training data](#training-data)
46
+
47
+ **License:** MEDIATEK RESEARCH License ([link](https://huggingface.co/MediaTek-Research/bloom-1b1-zh/blob/main/LICENSE_MR.md)) and RAIL License v1.0 ([link](https://huggingface.co/spaces/bigscience/license))
48
+
49
+ **Release Date Estimate:** Tuesday, 14.February.2023
50
+
51
+ **Send Questions to:** info@mtkresearch.com
52
+
53
+ **Cite as:** MediaTek Research, MediaTek Research Open-access Multilingual Language Model based on BLOOM. International, February 2023.
54
+
55
+ **Organizations of contributors:**
56
+
57
+ * MediaTek Research
58
+ * Academia Sinica
59
+
60
+ </details>
61
+
62
+ ### Technical Specifications
63
+ *This section provides information for people who work on model development.*
64
+
65
+ <details>
66
+ <summary>Click to expand</summary><br/>
67
+
68
+ **Model Architecture:** Modified from Megatron-LM GPT2 (see [paper](https://arxiv.org/abs/1909.08053), [BLOOM Megatron code](https://github.com/bigscience-workshop/Megatron-DeepSpeed)):
69
+
70
+ * Decoder-only architecture
71
+
72
+ * Layer normalization applied to word embeddings layer (`StableEmbedding`; see [code](https://github.com/facebookresearch/bitsandbytes), [paper](https://arxiv.org/pdf/2110.02861.pdf))
73
+
74
+ * ALiBI positional encodings (see [paper](https://arxiv.org/pdf/2108.12409.pdf)), with GeLU activation functions
75
+
76
+ * 1,065,314,304 parameters:
77
+
78
+ * 385,351,680 embedding parameters
79
+
80
+ * 24 layers, 16 attention heads
81
+
82
+ * Hidden layers are 1536-dimensional
83
+
84
+ * Sequence length of 2048 tokens used (see [BLOOM tokenizer](https://huggingface.co/bigscience/tokenizer), [tokenizer description](#tokenization))
85
+
86
+ **Objective Function:** Cross Entropy with mean reduction (see [API documentation](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss)).
87
+
88
+ **Compute infrastructure:**
89
+
90
+ * Hardware: 8 A6000 48GB GPUs (1 node):
91
+
92
+
93
+ * Software:
94
+
95
+ * Bigscience Megatron-DeepSpeed ([Github link](https://github.com/bigscience-workshop/Megatron-DeepSpeed))
96
+
97
+ * Megatron-DeepSpeed ([Github link](https://github.com/bigscience-workshop/Megatron-DeepSpeed))
98
+
99
+ * DeepSpeed ([Github link](https://github.com/microsoft/DeepSpeed))
100
+
101
+ * PyTorch (pytorch-1.11 w/ CUDA-11.5; see [Github link](https://github.com/pytorch/pytorch))
102
+
103
+ * apex ([Github link](https://github.com/NVIDIA/apex))
104
+
105
+
106
+ #### **Training**
107
+
108
+ Details are provided in the [paper](https://arxiv.org/).
109
+
110
+ - Number of epochs: 1
111
+
112
+ - Dates: Feb. 2023
113
+
114
+ #### **Tokenization**
115
+
116
+ The BLOOM tokenizer ([link](https://huggingface.co/bigscience/tokenizer)) is a learned subword tokenizer trained using:
117
+
118
+ - A byte-level Byte Pair Encoding (BPE) algorithm
119
+
120
+ - A simple pre-tokenization rule, no normalization
121
+
122
+ - A vocabulary size of 250,680
123
+
124
+ It was trained on a subset of a preliminary version of the corpus using alpha-weighting per language.
125
+
126
+ </details>
127
+
128
+
129
+ ### Environmental Impact
130
+
131
+ <details>
132
+ <summary>Click to expand</summary><br/>
133
+
134
+ Please refer to [Model card](https://huggingface.co/bigscience/bloom-1b1#model-details).
135
+
136
+
137
+ </details>
138
+ <p>&nbsp;</p>
139
+
140
+ ## Uses
141
+
142
+ *This section addresses questions around how the model is intended to be used, discusses the foreseeable users of the model (including those affected by the model), and describes uses that are considered out of scope or misuse of the model.
143
+ It provides information for anyone considering using the model or who is affected by the model.*
144
+
145
+
146
+ <details>
147
+ <summary>Click to expand</summary><br/>
148
+
149
+ Please refer to [Model card](https://huggingface.co/bigscience/bloom-1b1#uses).
150
+
151
+ </details>
152
+ <p>&nbsp;</p>
153
+
154
+ ## Training Data
155
+ *This section provides a high-level overview of the training data. It is relevant for anyone who wants to know the basics of what the model is learning.*
156
+
157
+
158
+ <details>
159
+ <summary>Click to expand</summary><br/>
160
+
161
+ We trained the 1B1 parameter model on a total of 6 Billion tokens mainly crawled from the internet and provided from National Academy for Educational Research, 75% of the training data is Traditional Chinese, 25% is English.
162
+
163
+ </details>
164
+ </details>
165
+ <p>&nbsp;</p>
166
+
167
+ ## Risks and Limitations
168
+ *This section identifies foreseeable harms and misunderstandings.*
169
+
170
+ <details>
171
+ <summary>Click to expand</summary><br/>
172
+
173
+ Please refer to [Model card](https://huggingface.co/bigscience/bloom-1b1#risks-and-limitations).
174
+
175
+ </details>
176
+ <p>&nbsp;</p>
177
+
178
+ ### Factors
179
+ *This section lists some different aspects of BLOOM models. Its focus is on those aspects that are likely to give rise to high variance in model behavior.*
180
+
181
+ - The model is trained on Traditional Chinese and English. However, the pretrained weights capture more than 40 different languages.
182
+
183
+ - The model is trained on web crawled data, news articles, novels, knowledge sources (encyclopedia, education sector) and instructions
184
+
185
+ </details>
186
+ <p>&nbsp;</p>
187
+
188
+ ## Recommendations
189
+
190
+ *This section provides information on warnings and potential mitigations.*
191
+
192
+
193
+ <details>
194
+ <summary>Click to expand</summary><br/>
195
+
196
+ Please refer to [Model card](https://huggingface.co/bigscience/bloom-1b1#recommendations).
197
+
198
+ </details>
199
+ <p>&nbsp;</p>
200
+
201
+
202
+ ## Model Card Authors
203
+ *Ordered roughly chronologically and by amount of time spent.*
204
+
205
+ Philipp Ennen, Po-Chun Hsu, Chan-Jan Hsu, Chang-Le Liu, Yin-Hsiang Liao, Chin-Tung Lin, Jezabel Rodriguez Garcia, Federica Freddi, Da-Shan Shiu, Wei-Yun Ma
206
+ # Bloom_eval