|
|
|
|
|
## token |
|
|
|
|
|
|
|
space |
|
```yml |
|
# multi-space |
|
{"id": 881, "token": "\r\n\r\n", "token_decode": "\r\n\r\n", "token_len": 4, "zh_count": 0, "space_count": 4, "digit_count": 0, "zh_symbol_count": 0} |
|
# space + en |
|
{"id": 862, "token": "\treturn", "token_decode": "\treturn", "token_len": 7, "zh_count": 0, "space_count": 1, "digit_count": 0, "zh_symbol_count": 0} |
|
# sapce + zh |
|
{"id": 40195, "token": " 下", "token_decode": " 下", "token_len": 2, "zh_count": 1, "space_count": 1, "digit_count": 0, "zh_symbol_count": 0} |
|
``` |
|
|
|
|
|
special_token |
|
``` |
|
{"id": 100257, "token": "<|endoftext|>", "token_decode": "<|endoftext|>", "token_len": 13, "zh_count": 0, "space_count": 0, "digit_count": 0, "zh_symbol_count": 0} |
|
{"id": 100258, "token": "<|fim_prefix|>", "token_decode": "<|fim_prefix|>", "token_len": 14, "zh_count": 0, "space_count": 0, "digit_count": 0, "zh_symbol_count": 0} |
|
{"id": 100259, "token": "<|fim_middle|>", "token_decode": "<|fim_middle|>", "token_len": 14, "zh_count": 0, "space_count": 0, "digit_count": 0, "zh_symbol_count": 0} |
|
{"id": 100260, "token": "<|fim_suffix|>", "token_decode": "<|fim_suffix|>", "token_len": 14, "zh_count": 0, "space_count": 0, "digit_count": 0, "zh_symbol_count": 0} |
|
{"id": 100276, "token": "<|endofprompt|>", "token_decode": "<|endofprompt|>", "token_len": 15, "zh_count": 0, "space_count": 0, "digit_count": 0, "zh_symbol_count": 0} |
|
``` |
|
|
|
汉字+符号 |
|
``` |
|
{"id": 39045, "token": ",请", "token_decode": ",请", "token_len": 2, "zh_count": 1, "space_count": 0, "digit_count": 0, "zh_symbol_count": 0} |
|
``` |
|
|
|
|
|
|
|
|
|
## 词典文件 |
|
|
|
|
|
``` |
|
IQ== 0 |
|
Ig== 1 |
|
Iw== 2 |
|
JA== 3 |
|
JQ== 4 |
|
Jg== 5 |
|
Jw== 6 |
|
KA== 7 |
|
``` |
|
|
|
这是啥玩意? |