Update README.md
Browse files
README.md
CHANGED
@@ -8,6 +8,8 @@ The tokenizer is trained with only Khmer/English. The corpus trained with approx
|
|
8 |
|
9 |
Tho model card has 7152 vocab size and its type is Byte Pair Encoding.
|
10 |
|
|
|
|
|
11 |
text_example = "αααααααΆααααα»ααΆααΆαααααααααααααΆααα·α
αα
ααΆαααααααα·α
αα
αα ααααα·ααααα·ααΆα"
|
12 |
|
13 |
[970, 273, 298, 420, 1583, 397, 284, 343, 259, 453, 397, 418, 1904, 259, 317]
|
|
|
8 |
|
9 |
Tho model card has 7152 vocab size and its type is Byte Pair Encoding.
|
10 |
|
11 |
+
Based on the well-known tokenizers, it's clearly that non-English words do not exist much in the pretrained vocab size. Therefore, it's slightly impossible to do long text translation between one to another.
|
12 |
+
|
13 |
text_example = "αααααααΆααααα»ααΆααΆαααααααααααααΆααα·α
αα
ααΆαααααααα·α
αα
αα ααααα·ααααα·ααΆα"
|
14 |
|
15 |
[970, 273, 298, 420, 1583, 397, 284, 343, 259, 453, 397, 418, 1904, 259, 317]
|