tok7152.model / README.md
orofido's picture
Update README.md
0463eb2 verified
metadata
license: mit
language:
  - km
  - en

The tokenizer is trained with only Khmer/English. The corpus trained with approximiately 3,000 links using SentencePiece with a similar configuration comparing to Llama3.

Tho model card has 7152 vocab size and its type is Byte Pair Encoding.

Based on the well-known tokenizers, it's clearly that non-English words do not exist much in the pretrained vocab size. Therefore, it's slightly impossible to do long text translation between one to another.

text_example = "αžšαžŠαŸ’αž‹αžŸαž—αžΆαž€αž˜αŸ’αž–αž»αž‡αžΆαžαžΆαž˜αžšαž™αŸˆαž‚αžŽαŸˆαž€αž˜αŸ’αž˜αž€αžΆαžšαž€αž·αž…αŸ’αž…αž€αžΆαžšαž”αžšαž‘αŸαžŸαž€αž·αž…αŸ’αž…αžŸαž αž”αŸ’αžšαžαž·αž”αžαŸ’αžαž·αž€αžΆαžš"

[970, 273, 298, 420, 1583, 397, 284, 343, 259, 453, 397, 418, 1904, 259, 317]

Tok7152 - Length: 15 Tokens

[21549, 248, 21549, 232, 73673, 233, 21549, 253, 21549, 245, 98629, 222, 21549, 246, 73673, 244, 21549, 119, 21549, 229, 98629, 237, 98629, 246, 21549, 248, 21549, 247, 45358, 230, 21549, 224, 21549, 236, 45358, 230, 21549, 222, 21549, 246, 73673, 246, 21549, 222, 98629, 248, 21549, 222, 21549, 115, 21549, 227, 73673, 227, 21549, 222, 98629, 248, 21549, 242, 21549, 248, 21549, 239, 45358, 223, 21549, 253, 21549, 222, 21549, 115, 21549, 227, 73673, 227, 21549, 253, 21549, 254, 21549, 242, 73673, 248, 21549, 237, 21549, 115, 21549, 242, 21549, 237, 73673, 237, 21549, 115, 21549, 222, 98629, 248]

GPT4Tokenizer CL100K - Length: 100 Tokens

Compressed Ratio: 6.67X times