Melayu BERT is a masked language model based on BERT. It was trained on the OSCAR dataset, specifically the
unshuffled_original_ms subset. The model used was English BERT model and fine-tuned on the Malaysian dataset. The model achieved a perplexity of 9.46 on a 20% validation dataset. Many of the techniques used are based on a Hugging Face tutorial notebook written by Sylvain Gugger, and fine-tuning tutorial notebook written by Pierre Guillou. The model is available both for PyTorch and TensorFlow use.
The model was trained on 3 epochs with a learning rate of 2e-3 and achieved a training loss per steps as shown below.
from transformers import pipeline pretrained_name = "StevenLimcorn/MelayuBERT" fill_mask = pipeline( "fill-mask", model=pretrained_name, tokenizer=pretrained_name ) fill_mask("Saya [MASK] makan nasi hari ini.")
from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("StevenLimcorn/MelayuBERT") model = AutoModelForMaskedLM.from_pretrained("StevenLimcorn/MelayuBERT")
- Downloads last month