eson's picture
fix chatglm; new feature about add_special_tokens;
d27a756

token

space

# multi-space
{"id": 881, "token": "\r\n\r\n", "token_decode": "\r\n\r\n", "token_len": 4, "zh_count": 0, "space_count": 4, "digit_count": 0, "zh_symbol_count": 0}
# space + en
{"id": 862, "token": "\treturn", "token_decode": "\treturn", "token_len": 7, "zh_count": 0, "space_count": 1, "digit_count": 0, "zh_symbol_count": 0}
# sapce + zh  
{"id": 40195, "token": " 下", "token_decode": " 下", "token_len": 2, "zh_count": 1, "space_count": 1, "digit_count": 0, "zh_symbol_count": 0}

special_token

{"id": 100257, "token": "<|endoftext|>", "token_decode": "<|endoftext|>", "token_len": 13, "zh_count": 0, "space_count": 0, "digit_count": 0, "zh_symbol_count": 0}
{"id": 100258, "token": "<|fim_prefix|>", "token_decode": "<|fim_prefix|>", "token_len": 14, "zh_count": 0, "space_count": 0, "digit_count": 0, "zh_symbol_count": 0}
{"id": 100259, "token": "<|fim_middle|>", "token_decode": "<|fim_middle|>", "token_len": 14, "zh_count": 0, "space_count": 0, "digit_count": 0, "zh_symbol_count": 0}
{"id": 100260, "token": "<|fim_suffix|>", "token_decode": "<|fim_suffix|>", "token_len": 14, "zh_count": 0, "space_count": 0, "digit_count": 0, "zh_symbol_count": 0}
{"id": 100276, "token": "<|endofprompt|>", "token_decode": "<|endofprompt|>", "token_len": 15, "zh_count": 0, "space_count": 0, "digit_count": 0, "zh_symbol_count": 0}

汉字+符号

{"id": 39045, "token": ",请", "token_decode": ",请", "token_len": 2, "zh_count": 1, "space_count": 0, "digit_count": 0, "zh_symbol_count": 0}

词典文件

IQ== 0
Ig== 1
Iw== 2
JA== 3
JQ== 4
Jg== 5
Jw== 6
KA== 7

这是啥玩意?