IndoBERT (Indonesian BERT Model)

Model description

IndoBERT is a pre-trained language model based on BERT architecture for the Indonesian Language.

This model is base-uncased version which use bert-base config.

Intended uses & limitations

How to use

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("sarahlintang/IndoBERT")
model = AutoModel.from_pretrained("sarahlintang/IndoBERT")
tokenizer.encode("hai aku mau makan.")
[2, 8078, 1785, 2318, 1946, 18, 4]

Training data

This model was pre-trained on 16 GB of raw text ~2 B words from Oscar Corpus (https://oscar-corpus.com/).

This model is equal to bert-base model which has 32,000 vocabulary size.

Training procedure

The training of the model has been performed using Google’s original Tensorflow code on eight core Google Cloud TPU v2. We used a Google Cloud Storage bucket, for persistent storage of training data and models.

Eval results

We evaluate this model on three Indonesian NLP downstream task:

  • some extractive summarization model
  • sentiment analysis
  • Part-of-Speech Tagger it was proven that this model outperforms multilingual BERT for all downstream tasks.
Downloads last month
49
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.

Dataset used to train sarahlintang/IndoBERT