artitsu commited on
Commit
e79ca99
1 Parent(s): 277d077

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -0
README.md CHANGED
@@ -32,6 +32,27 @@ This model is Hybrid CTC/Attention model with pre-trained HuBERT encoder.
32
 
33
  For evaluation, the metrics are CER and WER. before WER evaluation, transcriptions were re-tokenized using newmm tokenizer in [PyThaiNLP](https://github.com/PyThaiNLP/pythainlp)
34
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
  |Micro CER|Macro CER|Survival CER|E-commerce WER|Micro WER|Macro WER|Survival WER|E-commerce WER|
36
  |---|---|---|---|---|---|---|---|
37
  |5.35|5.65|6.29|5.02|7.53|8.73|11.38|6.09|
 
32
 
33
  For evaluation, the metrics are CER and WER. before WER evaluation, transcriptions were re-tokenized using newmm tokenizer in [PyThaiNLP](https://github.com/PyThaiNLP/pythainlp)
34
 
35
+ In this reposirity, we also provide the vocabulary for building the newmm tokenizer using this script:
36
+
37
+ ```python
38
+ from pythainlp import Tokenizer
39
+
40
+ def get_tokenizer(vocab):
41
+
42
+ custom_vocab = set(vocab)
43
+ custom_tokenizer = Tokenizer(custom_vocab, engine='newmm')
44
+ return custom_tokenizer
45
+
46
+ with open(<vocab_path>,'r',encoding='utf-8') as f:
47
+ vocab = []
48
+ for line in f.readlines():
49
+ vocab.append(line.strip())
50
+
51
+ custom_tokenizer = get_tokenizer(vocab)
52
+
53
+ tokenized_sentence_list = custom_tokenizer.word_tokenize(<your_sentence>)
54
+ ```
55
+
56
  |Micro CER|Macro CER|Survival CER|E-commerce WER|Micro WER|Macro WER|Survival WER|E-commerce WER|
57
  |---|---|---|---|---|---|---|---|
58
  |5.35|5.65|6.29|5.02|7.53|8.73|11.38|6.09|