Update README.md
Browse files
README.md
CHANGED
@@ -9,8 +9,11 @@ The tokenizer is trained with only Khmer/English. The corpus trained with approx
|
|
9 |
Tho model card has 7152 vocab size and its type is Byte Pair Encoding.
|
10 |
|
11 |
[970, 273, 298, 420, 1583, 397, 284, 343, 259, 453, 397, 418, 1904, 259, 317]
|
|
|
12 |
Tok7152 - Length: 15 Tokens
|
13 |
-
|
14 |
[21549, 248, 21549, 232, 73673, 233, 21549, 253, 21549, 245, 98629, 222, 21549, 246, 73673, 244, 21549, 119, 21549, 229, 98629, 237, 98629, 246, 21549, 248, 21549, 247, 45358, 230, 21549, 224, 21549, 236, 45358, 230, 21549, 222, 21549, 246, 73673, 246, 21549, 222, 98629, 248, 21549, 222, 21549, 115, 21549, 227, 73673, 227, 21549, 222, 98629, 248, 21549, 242, 21549, 248, 21549, 239, 45358, 223, 21549, 253, 21549, 222, 21549, 115, 21549, 227, 73673, 227, 21549, 253, 21549, 254, 21549, 242, 73673, 248, 21549, 237, 21549, 115, 21549, 242, 21549, 237, 73673, 237, 21549, 115, 21549, 222, 98629, 248]
|
|
|
15 |
GPT4Tokenizer CL100K - Length: 100 Tokens
|
|
|
16 |
Compressed Ratio: 6.67X times
|
|
|
9 |
Tho model card has 7152 vocab size and its type is Byte Pair Encoding.
|
10 |
|
11 |
[970, 273, 298, 420, 1583, 397, 284, 343, 259, 453, 397, 418, 1904, 259, 317]
|
12 |
+
|
13 |
Tok7152 - Length: 15 Tokens
|
14 |
+
|
15 |
[21549, 248, 21549, 232, 73673, 233, 21549, 253, 21549, 245, 98629, 222, 21549, 246, 73673, 244, 21549, 119, 21549, 229, 98629, 237, 98629, 246, 21549, 248, 21549, 247, 45358, 230, 21549, 224, 21549, 236, 45358, 230, 21549, 222, 21549, 246, 73673, 246, 21549, 222, 98629, 248, 21549, 222, 21549, 115, 21549, 227, 73673, 227, 21549, 222, 98629, 248, 21549, 242, 21549, 248, 21549, 239, 45358, 223, 21549, 253, 21549, 222, 21549, 115, 21549, 227, 73673, 227, 21549, 253, 21549, 254, 21549, 242, 73673, 248, 21549, 237, 21549, 115, 21549, 242, 21549, 237, 73673, 237, 21549, 115, 21549, 222, 98629, 248]
|
16 |
+
|
17 |
GPT4Tokenizer CL100K - Length: 100 Tokens
|
18 |
+
|
19 |
Compressed Ratio: 6.67X times
|