julien-c HF staff commited on
Commit
a652031
1 Parent(s): 423404a

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/cahya/bert-base-indonesian-522M/README.md

Files changed (1) hide show
  1. README.md +73 -0
README.md ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: "id"
3
+ license: "mit"
4
+ datasets:
5
+ - Indonesian Wikipedia
6
+ widget:
7
+ - text: "Ibu ku sedang bekerja [MASK] supermarket."
8
+ ---
9
+
10
+ # Indonesian BERT base model (uncased)
11
+
12
+ ## Model description
13
+ It is BERT-base model pre-trained with indonesian Wikipedia using a masked language modeling (MLM) objective. This
14
+ model is uncased: it does not make a difference between indonesia and Indonesia.
15
+
16
+ This is one of several other language models that have been pre-trained with indonesian datasets. More detail about
17
+ its usage on downstream tasks (text classification, text generation, etc) is available at [Transformer based Indonesian Language Models](https://github.com/cahya-wirawan/indonesian-language-models/tree/master/Transformers)
18
+
19
+ ## Intended uses & limitations
20
+
21
+ ### How to use
22
+ You can use this model directly with a pipeline for masked language modeling:
23
+ ```python
24
+ >>> from transformers import pipeline
25
+ >>> unmasker = pipeline('fill-mask', model='cahya/bert-base-indonesian-522M')
26
+ >>> unmasker("Ibu ku sedang bekerja [MASK] supermarket")
27
+
28
+ [{'sequence': '[CLS] ibu ku sedang bekerja di supermarket [SEP]',
29
+ 'score': 0.7983310222625732,
30
+ 'token': 1495},
31
+ {'sequence': '[CLS] ibu ku sedang bekerja. supermarket [SEP]',
32
+ 'score': 0.090003103017807,
33
+ 'token': 17},
34
+ {'sequence': '[CLS] ibu ku sedang bekerja sebagai supermarket [SEP]',
35
+ 'score': 0.025469014421105385,
36
+ 'token': 1600},
37
+ {'sequence': '[CLS] ibu ku sedang bekerja dengan supermarket [SEP]',
38
+ 'score': 0.017966199666261673,
39
+ 'token': 1555},
40
+ {'sequence': '[CLS] ibu ku sedang bekerja untuk supermarket [SEP]',
41
+ 'score': 0.016971781849861145,
42
+ 'token': 1572}]
43
+ ```
44
+ Here is how to use this model to get the features of a given text in PyTorch:
45
+ ```python
46
+ from transformers import BertTokenizer, BertModel
47
+
48
+ model_name='cahya/bert-base-indonesian-522M'
49
+ tokenizer = BertTokenizer.from_pretrained(model_name)
50
+ model = BertModel.from_pretrained(model_name)
51
+ text = "Silakan diganti dengan text apa saja."
52
+ encoded_input = tokenizer(text, return_tensors='pt')
53
+ output = model(**encoded_input)
54
+ ```
55
+ and in Tensorflow:
56
+ ```python
57
+ from transformers import BertTokenizer, TFBertModel
58
+
59
+ model_name='cahya/bert-base-indonesian-522M'
60
+ tokenizer = BertTokenizer.from_pretrained(model_name)
61
+ model = TFBertModel.from_pretrained(model_name)
62
+ text = "Silakan diganti dengan text apa saja."
63
+ encoded_input = tokenizer(text, return_tensors='tf')
64
+ output = model(encoded_input)
65
+ ```
66
+
67
+ ## Training data
68
+
69
+ This model was pre-trained with 522MB of indonesian Wikipedia.
70
+ The texts are lowercased and tokenized using WordPiece and a vocabulary size of 32,000. The inputs of the model are
71
+ then of the form:
72
+
73
+ ```[CLS] Sentence A [SEP] Sentence B [SEP]```