File size: 1,264 Bytes
6e83f77
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
---
language: id
datasets:
- oscar
---
# IndoBERT (Indonesian BERT Model)

## Model description
IndoBERT is a pre-trained language model based on BERT architecture for the Indonesian Language. 

This model is base-uncased version which use bert-base config.

## Intended uses & limitations

#### How to use

```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("sarahlintang/IndoBERT")
model = AutoModel.from_pretrained("sarahlintang/IndoBERT")
tokenizer.encode("hai aku mau makan.")
[2, 8078, 1785, 2318, 1946, 18, 4]
```


## Training data

This model was pre-trained on 16 GB of raw text ~2 B words from Oscar Corpus (https://oscar-corpus.com/). 

This model is equal to bert-base model which has 32,000 vocabulary size. 

## Training procedure

The training of the model has been performed using Google’s original Tensorflow code on eight core Google Cloud TPU v2.
We used a Google Cloud Storage bucket, for persistent storage of training data and models.

## Eval results

We evaluate this model on three Indonesian NLP downstream task:
- some extractive summarization model
- sentiment analysis
- Part-of-Speech Tagger
it was proven that this model outperforms multilingual BERT for all downstream tasks.