levuloihust commited on
Commit
8c38300
1 Parent(s): d3af284

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +75 -0
README.md ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - vi
4
+ - en
5
+ ---
6
+
7
+ # Description
8
+
9
+ This tokenizer is of type [Unigram](https://arxiv.org/pdf/1804.10959.pdf), supporting both English and Vietnamese.
10
+
11
+ Along with tokenization, this tokenizer also does diacritics normalization (for Vietnamese). For example: `hóa → hoá`, `hủy → huỷ`.
12
+
13
+ # Details
14
+
15
+ ## Library used to train
16
+ https://github.com/google/sentencepiece
17
+
18
+ ## Training Data
19
+ https://huggingface.co/datasets/levuloihust/vien-corpus-for-tokenizer
20
+
21
+ ## Training script
22
+ ```bash
23
+ ./spm_train \
24
+ --input=vien-corpus.txt \
25
+ --model_prefix=vien \
26
+ --vocab_size=64000 \
27
+ --user_defined_symbols_file=user_defined_symbols.txt \
28
+ --required_chars_file=required_chars.txt \
29
+ --unk_surface="<unk>" \
30
+ --byte_fallback=false \
31
+ --split_by_unicode_script=true \
32
+ --split_by_number=true \
33
+ --split_digits=true \
34
+ --normalization_rule_tsv=nmt_nfkc_vidiacritic.tsv
35
+ ```
36
+ `spm_train` is the executable file built by following installation guide in https://github.com/google/sentencepiece. Other files (`user_defined_symbols.txt`, `required_chars.txt` and `nmt_nfkc_vidiacritic.tsv`) are provided in this repo.
37
+
38
+ The training script should be run on a machine with 64GB RAM. After training, we get two files `vien.model` and `vien.vocab`.
39
+
40
+ ## Convert SPM model to HuggingFace tokenizer
41
+
42
+ Run the following python script to convert SPM model to HuggingFace tokenizer.
43
+ ```python
44
+ from transformers import DebertaV2Tokenizer
45
+
46
+ tokenizer = DebertaV2Tokenizer(
47
+ vocab_file="assets/spm/vien.model",
48
+ do_lower_case=False,
49
+ split_by_punct=False,
50
+ bos_token="<s>",
51
+ eos_token="</s>",
52
+ unk_token="<unk>",
53
+ sep_token="<sep>",
54
+ pad_token="<pad>",
55
+ cls_token="<cls>",
56
+ mask_token="<mask>"
57
+ )
58
+ tokenizer.save_pretrained("assets/hf-tokenizer")
59
+ ```
60
+ Replace `assets/spm/vien.model` and `assets/hf-tokenizer` with the correct path on your local machine.
61
+
62
+ ## Usage
63
+ ```python
64
+ from transformers import AutoTokenizer
65
+
66
+ tokenizer = AutoTokenizer.from_pretrained("levuloihust/vien-unigram-tokenizer", use_fast=False)
67
+ tokens = tokenizer.tokenize("How are you? Thời tiết hôm nay đẹp wóa trời lun =))")
68
+ print(tokens)
69
+ # ['▁How', '▁are', '▁you', '?', '▁Thời', '▁tiết', '▁hôm', '▁nay', '▁đẹp', '▁wo', 'á', '▁trời', '▁lun', '▁=))']
70
+ ```
71
+
72
+ Note that you must set `use_fast=False` for the tokenizer to properly function. In case `use_fast=True` (default), the tokenizer cannot perform normalization (Note that in the usage example, `wóa` was changed to `woá`)
73
+
74
+ # Contact information
75
+ For personal communication related to this project, please contact Loi Le Vu (levuloihust@gmail.com).