Xmm commited on
Commit
ece567b
1 Parent(s): ee18b0a

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +455 -0
README.md ADDED
@@ -0,0 +1,455 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - multilingual
4
+ - af
5
+ - sq
6
+ - ar
7
+ - an
8
+ - hy
9
+ - ast
10
+ - az
11
+ - ba
12
+ - eu
13
+ - bar
14
+ - be
15
+ - bn
16
+ - inc
17
+ - bs
18
+ - br
19
+ - bg
20
+ - my
21
+ - ca
22
+ - ceb
23
+ - ce
24
+ - zh
25
+ - cv
26
+ - hr
27
+ - cs
28
+ - da
29
+ - nl
30
+ - en
31
+ - et
32
+ - fi
33
+ - fr
34
+ - gl
35
+ - ka
36
+ - de
37
+ - el
38
+ - gu
39
+ - ht
40
+ - he
41
+ - hi
42
+ - hu
43
+ - is
44
+ - io
45
+ - id
46
+ - ga
47
+ - it
48
+ - ja
49
+ - jv
50
+ - kn
51
+ - kk
52
+ - ky
53
+ - ko
54
+ - la
55
+ - lv
56
+ - lt
57
+ - roa
58
+ - nds
59
+ - lm
60
+ - mk
61
+ - mg
62
+ - ms
63
+ - ml
64
+ - mr
65
+ - mn
66
+ - min
67
+ - ne
68
+ - new
69
+ - nb
70
+ - nn
71
+ - oc
72
+ - fa
73
+ - pms
74
+ - pl
75
+ - pt
76
+ - pa
77
+ - ro
78
+ - ru
79
+ - sco
80
+ - sr
81
+ - hr
82
+ - scn
83
+ - sk
84
+ - sl
85
+ - aze
86
+ - es
87
+ - su
88
+ - sw
89
+ - sv
90
+ - tl
91
+ - tg
92
+ - th
93
+ - ta
94
+ - tt
95
+ - te
96
+ - tr
97
+ - uk
98
+ - ud
99
+ - uz
100
+ - vi
101
+ - vo
102
+ - war
103
+ - cy
104
+ - fry
105
+ - pnb
106
+ - yo
107
+ license: apache-2.0
108
+ datasets:
109
+ - wikipedia
110
+ ---
111
+
112
+ # BERT multilingual base model (cased)
113
+
114
+ Pretrained model on the top 104 languages with the largest Wikipedia using a masked language modeling (MLM) objective.
115
+ It was introduced in [this paper](https://arxiv.org/abs/1810.04805) and first released in
116
+ [this repository](https://github.com/google-research/bert). This model is case sensitive: it makes a difference
117
+ between english and English.
118
+
119
+ Disclaimer: The team releasing BERT did not write a model card for this model so this model card has been written by
120
+ the Hugging Face team.
121
+
122
+ ## Model description
123
+
124
+ BERT is a transformers model pretrained on a large corpus of multilingual data in a self-supervised fashion. This means
125
+ it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of
126
+ publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it
127
+ was pretrained with two objectives:
128
+
129
+ - Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run
130
+ the entire masked sentence through the model and has to predict the masked words. This is different from traditional
131
+ recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like
132
+ GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the
133
+ sentence.
134
+ - Next sentence prediction (NSP): the models concatenates two masked sentences as inputs during pretraining. Sometimes
135
+ they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to
136
+ predict if the two sentences were following each other or not.
137
+
138
+ This way, the model learns an inner representation of the languages in the training set that can then be used to
139
+ extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a
140
+ standard classifier using the features produced by the BERT model as inputs.
141
+
142
+ ## Intended uses & limitations
143
+
144
+ You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
145
+ be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=bert) to look for
146
+ fine-tuned versions on a task that interests you.
147
+
148
+ Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
149
+ to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
150
+ generation you should look at model like GPT2.
151
+
152
+ ### How to use
153
+
154
+ You can use this model directly with a pipeline for masked language modeling:
155
+
156
+ ```python
157
+ >>> from transformers import pipeline
158
+ >>> unmasker = pipeline('fill-mask', model='bert-base-multilingual-cased')
159
+ >>> unmasker("Hello I'm a [MASK] model.")
160
+
161
+ [{'sequence': "[CLS] Hello I'm a model model. [SEP]",
162
+ 'score': 0.10182085633277893,
163
+ 'token': 13192,
164
+ 'token_str': 'model'},
165
+ {'sequence': "[CLS] Hello I'm a world model. [SEP]",
166
+ 'score': 0.052126359194517136,
167
+ 'token': 11356,
168
+ 'token_str': 'world'},
169
+ {'sequence': "[CLS] Hello I'm a data model. [SEP]",
170
+ 'score': 0.048930276185274124,
171
+ 'token': 11165,
172
+ 'token_str': 'data'},
173
+ {'sequence': "[CLS] Hello I'm a flight model. [SEP]",
174
+ 'score': 0.02036019042134285,
175
+ 'token': 23578,
176
+ 'token_str': 'flight'},
177
+ {'sequence': "[CLS] Hello I'm a business model. [SEP]",
178
+ 'score': 0.020079681649804115,
179
+ 'token': 14155,
180
+ 'token_str': 'business'}]
181
+ ```
182
+
183
+ Here is how to use this model to get the features of a given text in PyTorch:
184
+
185
+ ```python
186
+ from transformers import BertTokenizer, BertModel
187
+ tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
188
+ model = BertModel.from_pretrained("bert-base-multilingual-cased")
189
+ text = "Replace me by any text you'd like."
190
+ encoded_input = tokenizer(text, return_tensors='pt')
191
+ output = model(**encoded_input)
192
+ ```
193
+
194
+ and in TensorFlow:
195
+
196
+ ```python
197
+ from transformers import BertTokenizer, TFBertModel
198
+ tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
199
+ model = TFBertModel.from_pretrained("bert-base-multilingual-cased")
200
+ text = "Replace me by any text you'd like."
201
+ encoded_input = tokenizer(text, return_tensors='tf')
202
+ output = model(encoded_input)
203
+ ```
204
+
205
+ ## Training data
206
+
207
+ The BERT model was pretrained on the 104 languages with the largest Wikipedias. You can find the complete list
208
+ [here](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages).
209
+
210
+ ## Training procedure
211
+
212
+ ### Preprocessing
213
+
214
+ The texts are lowercased and tokenized using WordPiece and a shared vocabulary size of 110,000. The languages with a
215
+ larger Wikipedia are under-sampled and the ones with lower resources are oversampled. For languages like Chinese,
216
+ Japanese Kanji and Korean Hanja that don't have space, a CJK Unicode block is added around every character.
217
+
218
+ The inputs of the model are then of the form:
219
+
220
+ ```
221
+ [CLS] Sentence A [SEP] Sentence B [SEP]
222
+ ```
223
+
224
+ With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus and in
225
+ the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a
226
+ consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two
227
+ "sentences" has a combined length of less than 512 tokens.
228
+
229
+ The details of the masking procedure for each sentence are the following:
230
+ - 15% of the tokens are masked.
231
+ - In 80% of the cases, the masked tokens are replaced by `[MASK]`.
232
+ - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
233
+ - In the 10% remaining cases, the masked tokens are left as is.
234
+
235
+
236
+ ### BibTeX entry and citation info
237
+
238
+ ```bibtex
239
+ @article{DBLP:journals/corr/abs-1810-04805,
240
+ author = {Jacob Devlin and
241
+ Ming{-}Wei Chang and
242
+ Kenton Lee and
243
+ Kristina Toutanova},
244
+ title = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
245
+ Understanding},
246
+ journal = {CoRR},
247
+ volume = {abs/1810.04805},
248
+ year = {2018},
249
+ url = {http://arxiv.org/abs/1810.04805},
250
+ archivePrefix = {arXiv},
251
+ eprint = {1810.04805},
252
+ timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
253
+ biburl = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
254
+ bibsource = {dblp computer science bibliography, https://dblp.org}
255
+ }
256
+ ```---
257
+ # For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
258
+ # Doc / guide: https://huggingface.co/docs/hub/model-cards
259
+ {}
260
+ ---
261
+
262
+ # Model Card for Model ID
263
+
264
+ <!-- Provide a quick summary of what the model is/does. -->
265
+
266
+ This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1).
267
+
268
+ ## Model Details
269
+
270
+ ### Model Description
271
+
272
+ <!-- Provide a longer summary of what this model is. -->
273
+
274
+
275
+
276
+ - **Developed by:** [More Information Needed]
277
+ - **Funded by [optional]:** [More Information Needed]
278
+ - **Shared by [optional]:** [More Information Needed]
279
+ - **Model type:** [More Information Needed]
280
+ - **Language(s) (NLP):** [More Information Needed]
281
+ - **License:** [More Information Needed]
282
+ - **Finetuned from model [optional]:** [More Information Needed]
283
+
284
+ ### Model Sources [optional]
285
+
286
+ <!-- Provide the basic links for the model. -->
287
+
288
+ - **Repository:** [More Information Needed]
289
+ - **Paper [optional]:** [More Information Needed]
290
+ - **Demo [optional]:** [More Information Needed]
291
+
292
+ ## Uses
293
+
294
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
295
+
296
+ ### Direct Use
297
+
298
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
299
+
300
+ [More Information Needed]
301
+
302
+ ### Downstream Use [optional]
303
+
304
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
305
+
306
+ [More Information Needed]
307
+
308
+ ### Out-of-Scope Use
309
+
310
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
311
+
312
+ [More Information Needed]
313
+
314
+ ## Bias, Risks, and Limitations
315
+
316
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
317
+
318
+ [More Information Needed]
319
+
320
+ ### Recommendations
321
+
322
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
323
+
324
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
325
+
326
+ ## How to Get Started with the Model
327
+
328
+ Use the code below to get started with the model.
329
+
330
+ [More Information Needed]
331
+
332
+ ## Training Details
333
+
334
+ ### Training Data
335
+
336
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
337
+
338
+ [More Information Needed]
339
+
340
+ ### Training Procedure
341
+
342
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
343
+
344
+ #### Preprocessing [optional]
345
+
346
+ [More Information Needed]
347
+
348
+
349
+ #### Training Hyperparameters
350
+
351
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
352
+
353
+ #### Speeds, Sizes, Times [optional]
354
+
355
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
356
+
357
+ [More Information Needed]
358
+
359
+ ## Evaluation
360
+
361
+ <!-- This section describes the evaluation protocols and provides the results. -->
362
+
363
+ ### Testing Data, Factors & Metrics
364
+
365
+ #### Testing Data
366
+
367
+ <!-- This should link to a Dataset Card if possible. -->
368
+
369
+ [More Information Needed]
370
+
371
+ #### Factors
372
+
373
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
374
+
375
+ [More Information Needed]
376
+
377
+ #### Metrics
378
+
379
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
380
+
381
+ [More Information Needed]
382
+
383
+ ### Results
384
+
385
+ [More Information Needed]
386
+
387
+ #### Summary
388
+
389
+
390
+
391
+ ## Model Examination [optional]
392
+
393
+ <!-- Relevant interpretability work for the model goes here -->
394
+
395
+ [More Information Needed]
396
+
397
+ ## Environmental Impact
398
+
399
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
400
+
401
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
402
+
403
+ - **Hardware Type:** [More Information Needed]
404
+ - **Hours used:** [More Information Needed]
405
+ - **Cloud Provider:** [More Information Needed]
406
+ - **Compute Region:** [More Information Needed]
407
+ - **Carbon Emitted:** [More Information Needed]
408
+
409
+ ## Technical Specifications [optional]
410
+
411
+ ### Model Architecture and Objective
412
+
413
+ [More Information Needed]
414
+
415
+ ### Compute Infrastructure
416
+
417
+ [More Information Needed]
418
+
419
+ #### Hardware
420
+
421
+ [More Information Needed]
422
+
423
+ #### Software
424
+
425
+ [More Information Needed]
426
+
427
+ ## Citation [optional]
428
+
429
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
430
+
431
+ **BibTeX:**
432
+
433
+ [More Information Needed]
434
+
435
+ **APA:**
436
+
437
+ [More Information Needed]
438
+
439
+ ## Glossary [optional]
440
+
441
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
442
+
443
+ [More Information Needed]
444
+
445
+ ## More Information [optional]
446
+
447
+ [More Information Needed]
448
+
449
+ ## Model Card Authors [optional]
450
+
451
+ [More Information Needed]
452
+
453
+ ## Model Card Contact
454
+
455
+ [More Information Needed]