IndoBERT-Lite Base Model (phase2 - uncased)

IndoBERT is a state-of-the-art language model for Indonesian based on the BERT model. The pretrained model is trained using a masked language modeling (MLM) objective and next sentence prediction (NSP) objective.

All Pre-trained Models

Model #params Arch. Training data
indobenchmark/indobert-base-p1 124.5M Base Indo4B (23.43 GB of text)
indobenchmark/indobert-base-p2 124.5M Base Indo4B (23.43 GB of text)
indobenchmark/indobert-large-p1 335.2M Large Indo4B (23.43 GB of text)
indobenchmark/indobert-large-p2 335.2M Large Indo4B (23.43 GB of text)
indobenchmark/indobert-lite-base-p1 11.7M Base Indo4B (23.43 GB of text)
indobenchmark/indobert-lite-base-p2 11.7M Base Indo4B (23.43 GB of text)
indobenchmark/indobert-lite-large-p1 17.7M Large Indo4B (23.43 GB of text)
indobenchmark/indobert-lite-large-p2 17.7M Large Indo4B (23.43 GB of text)

How to use

Load model and tokenizer

from transformers import BertTokenizer, AutoModel
tokenizer = BertTokenizer.from_pretrained("indobenchmark/indobert-lite-base-p2")
model = AutoModel.from_pretrained("indobenchmark/indobert-lite-base-p2")

Extract contextual representation

x = torch.LongTensor(tokenizer.encode('aku adalah anak [MASK]')).view(1,-1)
print(x, model(x)[0].sum())


IndoBERT was trained and evaluated by Bryan Wilie*, Karissa Vincentio*, Genta Indra Winata*, Samuel Cahyawijaya*, Xiaohong Li, Zhi Yuan Lim, Sidik Soleman, Rahmad Mahendra, Pascale Fung, Syafri Bahar, Ayu Purwarianti.


