tokenizer-arena / README.md
eson's picture
update
7d2062e
metadata
title: Tokenizer Arena
emoji: 
colorFrom: red
colorTo: gray
sdk: gradio
sdk_version: 3.41.2
app_file: app.py
pinned: false

压缩率 Compress Rate

cc-100 数据集,每个语言取1万条数据,测试不同tokenizer的压缩率。压缩率指标 g_bytes/b_tokens

您可通过以下脚本进行复现

python utils/compress_rate_util.py 
简体中文压缩率 在简体中文数据集 cc100-zh-Hans 计算压缩率
tokenizer vocab_size g_bytes/b_tokens t_bytes/t_tokens b_tokens/g_bytes
amber 32000 1.84 1.8 0.54
aya_101 250100 3.89 3.79 0.26
baichuan 64000 3.92 3.82 0.26
baichuan2 125696 4.53 4.42 0.22
bert_base_cased 28996 2.73 2.66 0.37
bert_base_chinese 21128 2.74 2.67 0.37
bert_base_uncased 30522 2.73 2.67 0.37
bloom 250680 4.28 4.18 0.23
byt5_small 256 0.93 0.91 1.08
character_glm_6b 64794 4.2 4.1 0.24
chatglm2_6b 64794 4.2 4.1 0.24
chatglm3_6b 64798 4.2 4.1 0.24
chatglm_6b 150344 4.65 4.54 0.22
chatyuan_large_v2 32128 4.34 4.24 0.23
chinese_llama 49953 3.93 3.84 0.25
chinese_llama2 55296 3.92 3.83 0.26
code_davinci_002 50281 1.31 1.28 0.77
crystal_coder 32000 1.86 1.81 0.54
deepseek_coder_33b_instruct 32000 3.4 3.32 0.29
deepseek_llm_7b_base 100000 4.05 3.96 0.25
falcon_180b 65024 2.18 2.13 0.46
falcon_7b 65024 2.18 2.13 0.46
fastchat_t5_3b 32000 13.7 13.38 0.07
flan_t5_base 32100 14.13 13.8 0.07
gemma_7b 256000 3.82 3.73 0.26
gpt2 50257 1.31 1.28 0.77
gpt2_chinese 21128 2.73 2.66 0.37
gpt_35_turbo 100277 2.26 2.21 0.44
gpt_4 100277 2.26 2.21 0.44
gpt_nexo_20b 50254 2.01 1.96 0.5
internlm2_chat_7b 92544 4.23 4.13 0.24
internlm2_math_7b 92544 4.23 4.13 0.24
internlm_chat_7b 103168 4.23 4.14 0.24
internlm_xcomposer_7b 103168 4.23 4.14 0.24
kplug 10261 2.72 2.65 0.37
llama 32000 1.84 1.8 0.54
llama2 32000 1.84 1.8 0.54
mistral_7b 32000 2.36 2.3 0.42
mixtral_8_7b 32000 2.36 2.3 0.42
mobilebert_uncased 30522 2.73 2.67 0.37
moss 106029 4.4 4.3 0.23
mt5_large 250100 3.89 3.79 0.26
olmo_7b 50280 2.01 1.96 0.5
orion_14b_chat 84608 4.63 4.52 0.22
phi_1 50257 1.31 1.28 0.77
phi_2 50257 1.31 1.28 0.77
pko_t5_large 50258 0.97 0.95 1.03
prompt_clue 32128 4.34 4.24 0.23
qwen1_5_14b_chat 151643 4.16 4.06 0.24
qwen_1_8b_chat 151851 4.16 4.06 0.24
qwen_72b_chat 151851 4.16 4.06 0.24
qwen_7b_chat 151851 4.16 4.06 0.24
roberta_chinese_clue 8021 2.7 2.64 0.37
skywork_13b_base 65519 3.69 3.61 0.27
skywork_13b_math 65519 3.69 3.61 0.27
solar_10_7b 32000 2.36 2.3 0.42
starchat_alpha 49152 2.78 2.72 0.36
switch_c_2048 32100 14.13 13.8 0.07
t5_base 32100 14.13 13.8 0.07
t5_large 32100 14.13 13.8 0.07
t5_small 32100 14.13 13.8 0.07
text_davinci_003 50281 1.31 1.28 0.77
tigerbot_13b_chat_v2 60512 4.25 4.15 0.24
tigerbot_70b_chat_v4_4k 65107 4.25 4.15 0.24
wizardcoder_15b_v1 49152 2.78 2.72 0.36
wizardcoder_python_7b_v1 32000 1.84 1.8 0.54
wizardlm_7b_v1 32000 1.84 1.8 0.54
wizardmath_70b_v1 32000 1.84 1.8 0.54
xlm_roberta 250002 3.96 3.86 0.25
yi_34b 64000 4.17 4.07 0.24
yi_6b 64000 4.17 4.07 0.24
yi_vl34b 64000 4.11 4.02 0.24
zephyr_7b_beta 32000 2.36 2.3 0.42

结论 larger vocabulary sizes

Reference