|
--- |
|
language: id |
|
datasets: |
|
- oscar |
|
--- |
|
# IndoBERT (Indonesian BERT Model) |
|
|
|
## Model description |
|
IndoBERT is a pre-trained language model based on BERT architecture for the Indonesian Language. |
|
|
|
This model is base-uncased version which use bert-base config. |
|
|
|
## Intended uses & limitations |
|
|
|
#### How to use |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModel |
|
tokenizer = AutoTokenizer.from_pretrained("sarahlintang/IndoBERT") |
|
model = AutoModel.from_pretrained("sarahlintang/IndoBERT") |
|
tokenizer.encode("hai aku mau makan.") |
|
[2, 8078, 1785, 2318, 1946, 18, 4] |
|
``` |
|
|
|
|
|
## Training data |
|
|
|
This model was pre-trained on 16 GB of raw text ~2 B words from Oscar Corpus (https://oscar-corpus.com/). |
|
|
|
This model is equal to bert-base model which has 32,000 vocabulary size. |
|
|
|
## Training procedure |
|
|
|
The training of the model has been performed using Google’s original Tensorflow code on eight core Google Cloud TPU v2. |
|
We used a Google Cloud Storage bucket, for persistent storage of training data and models. |
|
|
|
## Eval results |
|
|
|
We evaluate this model on three Indonesian NLP downstream task: |
|
- some extractive summarization model |
|
- sentiment analysis |
|
- Part-of-Speech Tagger |
|
it was proven that this model outperforms multilingual BERT for all downstream tasks. |
|
|