Melayu BERT

Melayu BERT is a masked language model based on BERT. It was trained on the OSCAR dataset, specifically the unshuffled_original_ms subset. The model used was English BERT model and fine-tuned on the Malaysian dataset. The model achieved a perplexity of 9.46 on a 20% validation dataset. Many of the techniques used are based on a Hugging Face tutorial notebook written by Sylvain Gugger, and fine-tuning tutorial notebook written by Pierre Guillou. The model is available both for PyTorch and TensorFlow use.


The model was trained on 3 epochs with a learning rate of 2e-3 and achieved a training loss per steps as shown below.

Step Training loss
500 5.051300
1000 3.701700
1500 3.288600
2000 3.024000
2500 2.833500
3000 2.741600
3500 2.637900
4000 2.547900
4500 2.451500
5000 2.409600
5500 2.388300
6000 2.351600

How to Use

As Masked Language Model

from transformers import pipeline
pretrained_name = "StevenLimcorn/MelayuBERT"
fill_mask = pipeline(
fill_mask("Saya [MASK] makan nasi hari ini.")

Import Tokenizer and Model

from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("StevenLimcorn/MelayuBERT")

model = AutoModelForMaskedLM.from_pretrained("StevenLimcorn/MelayuBERT")


Melayu BERT was trained by Steven Limcorn and Wilson Wongso.

Downloads last month
Hosted inference API
Mask token: [MASK]
This model can be loaded on the Inference API on-demand.

Dataset used to train StevenLimcorn/MelayuBERT