orofido
/

tok7152.model

Model card Files Files and versions Community

orofido commited on Jul 29

Commit

0463eb2

•

1 Parent(s): fcf39dd

Update README.md

Files changed (1) hide show

README.md +2 -0

README.md CHANGED Viewed

@@ -8,6 +8,8 @@ The tokenizer is trained with only Khmer/English. The corpus trained with approx
 Tho model card has 7152 vocab size and its type is Byte Pair Encoding.
 text_example = "រដ្ឋសភាកម្ពុជាតាមរយៈគណៈកម្មការកិច្ចការបរទេសកិច្ចសហប្រតិបត្តិការ"
 [970, 273, 298, 420, 1583, 397, 284, 343, 259, 453, 397, 418, 1904, 259, 317]

 Tho model card has 7152 vocab size and its type is Byte Pair Encoding.
+Based on the well-known tokenizers, it's clearly that non-English words do not exist much in the pretrained vocab size. Therefore, it's slightly impossible to do long text translation between one to another.
 text_example = "រដ្ឋសភាកម្ពុជាតាមរយៈគណៈកម្មការកិច្ចការបរទេសកិច្ចសហប្រតិបត្តិការ"
 [970, 273, 298, 420, 1583, 397, 284, 343, 259, 453, 397, 418, 1904, 259, 317]