File size: 2,734 Bytes
fcdd15e
 
 
 
9acdedb
fcdd15e
 
9acdedb
fcdd15e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
language: "id"
license: "mit"
datasets:
- wikipedia
- id_newspapers_2018
widget:
- text: "Ibu ku sedang bekerja [MASK] sawah."
---

# Indonesian BERT base model (uncased) 

## Model description
It is BERT-base model pre-trained with indonesian Wikipedia and indonesian newspapers using a masked language modeling (MLM) objective. This 
model is uncased.

This is one of several other language models that have been pre-trained with indonesian datasets. More detail about 
its usage on downstream tasks (text classification, text generation, etc) is available at [Transformer based Indonesian Language Models](https://github.com/cahya-wirawan/indonesian-language-models/tree/master/Transformers)

## Intended uses & limitations

### How to use
You can use this model directly with a pipeline for masked language modeling:
```python
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='cahya/bert-base-indonesian-1.5G')
>>> unmasker("Ibu ku sedang bekerja [MASK] supermarket")

[{'sequence': '[CLS] ibu ku sedang bekerja di supermarket [SEP]',
  'score': 0.7983310222625732,
  'token': 1495},
 {'sequence': '[CLS] ibu ku sedang bekerja. supermarket [SEP]',
  'score': 0.090003103017807,
  'token': 17},
 {'sequence': '[CLS] ibu ku sedang bekerja sebagai supermarket [SEP]',
  'score': 0.025469014421105385,
  'token': 1600},
 {'sequence': '[CLS] ibu ku sedang bekerja dengan supermarket [SEP]',
  'score': 0.017966199666261673,
  'token': 1555},
 {'sequence': '[CLS] ibu ku sedang bekerja untuk supermarket [SEP]',
  'score': 0.016971781849861145,
  'token': 1572}]
```
Here is how to use this model to get the features of a given text in PyTorch:
```python
from transformers import BertTokenizer, BertModel

model_name='cahya/bert-base-indonesian-1.5G'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)
text = "Silakan diganti dengan text apa saja."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
```
and in Tensorflow:
```python
from transformers import BertTokenizer, TFBertModel

model_name='cahya/bert-base-indonesian-1.5G'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = TFBertModel.from_pretrained(model_name)
text = "Silakan diganti dengan text apa saja."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
```

## Training data

This model was pre-trained with 522MB of indonesian Wikipedia and 1GB of
[indonesian newspapers](https://huggingface.co/datasets/id_newspapers_2018).
The texts are lowercased and tokenized using WordPiece and a vocabulary size of 32,000. The inputs of the model are 
then of the form:

```[CLS] Sentence A [SEP] Sentence B [SEP]```