Edit model card

roberta-base-bahasa-cased

Pretrained RoBERTa base language model for Malay.

Pretraining Corpus

roberta-base-bahasa-cased model was pretrained on ~400 miliion words. Below is list of data we trained on,

  1. IIUM confession, https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean
  2. local Instagram, https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean
  3. local news, https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean
  4. local parliament hansards, https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean
  5. local research papers related to kebudayaan, keagaaman and etnik, https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean
  6. local twitter, https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean
  7. Malay Wattpad, https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean
  8. Malay Wikipedia, https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean

Pretraining details

Example using AutoModelWithLMHead

from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline

model = AutoModelForMaskedLM.from_pretrained('mesolitica/roberta-base-bahasa-cased')
tokenizer = AutoTokenizer.from_pretrained(
    'mesolitica/roberta-base-bahasa-cased',
    do_lower_case = False,
)
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)
fill_mask('Permohonan Najib, anak untuk dengar isu perlembagaan <mask> .')

Output is,

[{'score': 0.3368818759918213,
  'token': 746,
  'token_str': ' negara',
  'sequence': 'Permohonan Najib, anak untuk dengar isu perlembagaan negara.'},
 {'score': 0.09646568447351456,
  'token': 598,
  'token_str': ' Malaysia',
  'sequence': 'Permohonan Najib, anak untuk dengar isu perlembagaan Malaysia.'},
 {'score': 0.029483484104275703,
  'token': 3265,
  'token_str': ' UMNO',
  'sequence': 'Permohonan Najib, anak untuk dengar isu perlembagaan UMNO.'},
 {'score': 0.026470622047781944,
  'token': 2562,
  'token_str': ' parti',
  'sequence': 'Permohonan Najib, anak untuk dengar isu perlembagaan parti.'},
 {'score': 0.023237623274326324,
  'token': 391,
  'token_str': ' ini',
  'sequence': 'Permohonan Najib, anak untuk dengar isu perlembagaan ini.'}]
Downloads last month
19
Hosted inference API
Mask token: <mask>
This model can be loaded on the Inference API on-demand.