IndoBERT (Indonesian BERT Model)

Model description

IndoBERT is a pre-trained language model based on BERT architecture for the Indonesian Language.

This model is base-uncased version which use bert-base config.

Intended uses & limitations

How to use

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("sarahlintang/IndoBERT")
model = AutoModel.from_pretrained("sarahlintang/IndoBERT")
tokenizer.encode("hai aku mau makan.")
[2, 8078, 1785, 2318, 1946, 18, 4]

Training data

This model was pre-trained on 16 GB of raw text ~2 B words from Oscar Corpus (https://oscar-corpus.com/).

This model is equal to bert-base model which has 32,000 vocabulary size.

Training procedure

The training of the model has been performed using Google’s original Tensorflow code on eight core Google Cloud TPU v2. We used a Google Cloud Storage bucket, for persistent storage of training data and models.

Eval results

We evaluate this model on three Indonesian NLP downstream task:

some extractive summarization model
sentiment analysis
Part-of-Speech Tagger it was proven that this model outperforms multilingual BERT for all downstream tasks.

sarahlintang
/

IndoBERT