Pre-trained Masked Language Model for Vietnamese Nôm

A masked language model for Nôm script is a specialized version of a language model designed to understand and generate text in the Chữ Nôm script. Chữ Nôm is a logographic writing system used in Vietnam from the 13th to the early 20th century, primarily before the introduction of the Latin-based Vietnamese script.

Similar to other masked language models, such as GPT-3, the Chữ Nôm masked language model is trained on a large corpus of Chữ Nôm texts. This training data helps the model learn the statistical patterns, contextual relationships, and semantic meanings of characters and words in the Chữ Nôm script.

Model was trained on some literary works and poetry: Bai ca ran co bac, Buom hoa tan truyen, Chinh phu ngam, Gia huan ca, Ho Xuan Huong, Luc Van Tien, Tale of Kieu-1870, Tale of Kieu 1871, Tale of kieu 1902,...

How to use the model

from transformers import RobertaTokenizerFast, RobertaForMaskedLM
import torch
# Load the tokenizer
tokenizer = RobertaTokenizerFast.from_pretrained('minhtoan/roberta-masked-lm-vietnamese-nom')
# Load the model
model = RobertaForMaskedLM.from_pretrained('minhtoan/roberta-masked-lm-vietnamese-nom')

text = '<mask>如㗂䳽𠖤戈'
inputs = tokenizer(text, return_tensors="pt")
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]

logits = model(**inputs).logits
mask_token_logits = logits[0, mask_token_index, :]
print("Predicted word:",  tokenizer.decode(mask_token_logits[0].argmax()))

Author

Phan Minh Toan