Description

This tokenizer is of type Unigram, supporting both English and Vietnamese.

Along with tokenization, this tokenizer also does diacritics normalization (for Vietnamese). For example: hóa → hoá, hủy → huỷ.

Details

Library used to train

https://github.com/google/sentencepiece

Training Data

https://huggingface.co/datasets/levuloihust/vien-corpus-for-tokenizer

Training script

./spm_train \
    --input=vien-corpus.txt \
    --model_prefix=vien \
    --vocab_size=64000 \
    --user_defined_symbols_file=user_defined_symbols.txt \
    --required_chars_file=required_chars.txt \
    --unk_surface="<unk>" \
    --byte_fallback=false \
    --split_by_unicode_script=true \
    --split_by_number=true \
    --split_digits=true \
    --normalization_rule_tsv=nmt_nfkc_vidiacritic.tsv

spm_train is the executable file built by following installation guide in https://github.com/google/sentencepiece. Other files (user_defined_symbols.txt, required_chars.txt and nmt_nfkc_vidiacritic.tsv) are provided in this repo.

The training script should be run on a machine with 64GB RAM. After training, we get two files vien.model and vien.vocab.

Convert SPM model to HuggingFace tokenizer

Run the following python script to convert SPM model to HuggingFace tokenizer.

from transformers import DebertaV2Tokenizer

tokenizer = DebertaV2Tokenizer(
    vocab_file="assets/spm/vien.model",
    do_lower_case=False,
    split_by_punct=False,
    bos_token="<s>",
    eos_token="</s>",
    unk_token="<unk>",
    sep_token="<sep>",
    pad_token="<pad>",
    cls_token="<cls>",
    mask_token="<mask>"
)
tokenizer.save_pretrained("assets/hf-tokenizer")

Replace assets/spm/vien.model and assets/hf-tokenizer with the correct path on your local machine.

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("levuloihust/vien-unigram-tokenizer", use_fast=False)
tokens = tokenizer.tokenize("How are you? Thời tiết hôm nay đẹp wóa trời lun =))")
print(tokens)
# ['▁How', '▁are', '▁you', '?', '▁Thời', '▁tiết', '▁hôm', '▁nay', '▁đẹp', '▁wo', 'á', '▁trời', '▁lun', '▁=))']

Note that you must set use_fast=False for the tokenizer to properly function. In case use_fast=True (default), the tokenizer cannot perform normalization (Note that in the usage example, wóa was changed to woá)

Contact information

For personal communication related to this project, please contact Loi Le Vu (levuloihust@gmail.com).