xlnet-large-bahasa-cased

Pretrained XLNET large language model for Malay.

Pretraining Corpus

xlnet-large-bahasa-cased model was pretrained on ~1.4 Billion words. Below is list of data we trained on,

cleaned local texts.
translated The Pile.

Pretraining details

All steps can reproduce from here, Malaya/pretrained-model/xlnet.

Load Pretrained Model

You can use this model by installing torch or tensorflow and Huggingface library transformers. And you can use it directly by initializing it like this:

from transformers import XLNetModel, XLNetTokenizer

model = XLNetModel.from_pretrained('malay-huggingface/xlnet-large-bahasa-cased')
tokenizer = XLNetTokenizer.from_pretrained(
    'malay-huggingface/xlnet-large-bahasa-cased',
    do_lower_case = False,
)