orofido commited on
Commit
0463eb2
β€’
1 Parent(s): fcf39dd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -0
README.md CHANGED
@@ -8,6 +8,8 @@ The tokenizer is trained with only Khmer/English. The corpus trained with approx
8
 
9
  Tho model card has 7152 vocab size and its type is Byte Pair Encoding.
10
 
 
 
11
  text_example = "αžšαžŠαŸ’αž‹αžŸαž—αžΆαž€αž˜αŸ’αž–αž»αž‡αžΆαžαžΆαž˜αžšαž™αŸˆαž‚αžŽαŸˆαž€αž˜αŸ’αž˜αž€αžΆαžšαž€αž·αž…αŸ’αž…αž€αžΆαžšαž”αžšαž‘αŸαžŸαž€αž·αž…αŸ’αž…αžŸαž αž”αŸ’αžšαžαž·αž”αžαŸ’αžαž·αž€αžΆαžš"
12
 
13
  [970, 273, 298, 420, 1583, 397, 284, 343, 259, 453, 397, 418, 1904, 259, 317]
 
8
 
9
  Tho model card has 7152 vocab size and its type is Byte Pair Encoding.
10
 
11
+ Based on the well-known tokenizers, it's clearly that non-English words do not exist much in the pretrained vocab size. Therefore, it's slightly impossible to do long text translation between one to another.
12
+
13
  text_example = "αžšαžŠαŸ’αž‹αžŸαž—αžΆαž€αž˜αŸ’αž–αž»αž‡αžΆαžαžΆαž˜αžšαž™αŸˆαž‚αžŽαŸˆαž€αž˜αŸ’αž˜αž€αžΆαžšαž€αž·αž…αŸ’αž…αž€αžΆαžšαž”αžšαž‘αŸαžŸαž€αž·αž…αŸ’αž…αžŸαž αž”αŸ’αžšαžαž·αž”αžαŸ’αžαž·αž€αžΆαžš"
14
 
15
  [970, 273, 298, 420, 1583, 397, 284, 343, 259, 453, 397, 418, 1904, 259, 317]