Description
This tokenizer is of type Unigram, supporting both English and Vietnamese.
Along with tokenization, this tokenizer also does diacritics normalization (for Vietnamese). For example: hóa → hoá
, hủy → huỷ
.
Details
Library used to train
https://github.com/google/sentencepiece
Training Data
https://huggingface.co/datasets/levuloihust/vien-corpus-for-tokenizer
Training script
./spm_train \
--input=vien-corpus.txt \
--model_prefix=vien \
--vocab_size=64000 \
--user_defined_symbols_file=user_defined_symbols.txt \
--required_chars_file=required_chars.txt \
--unk_surface="<unk>" \
--byte_fallback=false \
--split_by_unicode_script=true \
--split_by_number=true \
--split_digits=true \
--normalization_rule_tsv=nmt_nfkc_vidiacritic.tsv
spm_train
is the executable file built by following installation guide in https://github.com/google/sentencepiece. Other files (user_defined_symbols.txt
, required_chars.txt
and nmt_nfkc_vidiacritic.tsv
) are provided in this repo.
The training script should be run on a machine with 64GB RAM. After training, we get two files vien.model
and vien.vocab
.
Convert SPM model to HuggingFace tokenizer
Run the following python script to convert SPM model to HuggingFace tokenizer.
from transformers import DebertaV2Tokenizer
tokenizer = DebertaV2Tokenizer(
vocab_file="assets/spm/vien.model",
do_lower_case=False,
split_by_punct=False,
bos_token="<s>",
eos_token="</s>",
unk_token="<unk>",
sep_token="<sep>",
pad_token="<pad>",
cls_token="<cls>",
mask_token="<mask>"
)
tokenizer.save_pretrained("assets/hf-tokenizer")
Replace assets/spm/vien.model
and assets/hf-tokenizer
with the correct path on your local machine.
Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("levuloihust/vien-unigram-tokenizer", use_fast=False)
tokens = tokenizer.tokenize("How are you? Thời tiết hôm nay đẹp wóa trời lun =))")
print(tokens)
# ['▁How', '▁are', '▁you', '?', '▁Thời', '▁tiết', '▁hôm', '▁nay', '▁đẹp', '▁wo', 'á', '▁trời', '▁lun', '▁=))']
Note that you must set use_fast=False
for the tokenizer to properly function. In case use_fast=True
(default), the tokenizer cannot perform normalization (Note that in the usage example, wóa
was changed to woá
)
Contact information
For personal communication related to this project, please contact Loi Le Vu (levuloihust@gmail.com).