README.md · xu-song/tokenizer-arena at 7d2062ed2654f3fac70de6e5711d26c6ee7f67ce

metadata

title: Tokenizer Arena
emoji: ⚡
colorFrom: red
colorTo: gray
sdk: gradio
sdk_version: 3.41.2
app_file: app.py
pinned: false

压缩率 Compress Rate

在 cc-100 数据集，每个语言取1万条数据，测试不同tokenizer的压缩率。压缩率指标 g_bytes/b_tokens

您可通过以下脚本进行复现

python utils/compress_rate_util.py

简体中文压缩率

在简体中文数据集 cc100-zh-Hans 计算压缩率

tokenizer	vocab_size	g_bytes/b_tokens	t_bytes/t_tokens	b_tokens/g_bytes
amber	32000	1.84	1.8	0.54
aya_101	250100	3.89	3.79	0.26
baichuan	64000	3.92	3.82	0.26
baichuan2	125696	4.53	4.42	0.22
bert_base_cased	28996	2.73	2.66	0.37
bert_base_chinese	21128	2.74	2.67	0.37
bert_base_uncased	30522	2.73	2.67	0.37
bloom	250680	4.28	4.18	0.23
byt5_small	256	0.93	0.91	1.08
character_glm_6b	64794	4.2	4.1	0.24
chatglm2_6b	64794	4.2	4.1	0.24
chatglm3_6b	64798	4.2	4.1	0.24
chatglm_6b	150344	4.65	4.54	0.22
chatyuan_large_v2	32128	4.34	4.24	0.23
chinese_llama	49953	3.93	3.84	0.25
chinese_llama2	55296	3.92	3.83	0.26
code_davinci_002	50281	1.31	1.28	0.77
crystal_coder	32000	1.86	1.81	0.54
deepseek_coder_33b_instruct	32000	3.4	3.32	0.29
deepseek_llm_7b_base	100000	4.05	3.96	0.25
falcon_180b	65024	2.18	2.13	0.46
falcon_7b	65024	2.18	2.13	0.46
fastchat_t5_3b	32000	13.7	13.38	0.07
flan_t5_base	32100	14.13	13.8	0.07
gemma_7b	256000	3.82	3.73	0.26
gpt2	50257	1.31	1.28	0.77
gpt2_chinese	21128	2.73	2.66	0.37
gpt_35_turbo	100277	2.26	2.21	0.44
gpt_4	100277	2.26	2.21	0.44
gpt_nexo_20b	50254	2.01	1.96	0.5
internlm2_chat_7b	92544	4.23	4.13	0.24
internlm2_math_7b	92544	4.23	4.13	0.24
internlm_chat_7b	103168	4.23	4.14	0.24
internlm_xcomposer_7b	103168	4.23	4.14	0.24
kplug	10261	2.72	2.65	0.37
llama	32000	1.84	1.8	0.54
llama2	32000	1.84	1.8	0.54
mistral_7b	32000	2.36	2.3	0.42
mixtral_8_7b	32000	2.36	2.3	0.42
mobilebert_uncased	30522	2.73	2.67	0.37
moss	106029	4.4	4.3	0.23
mt5_large	250100	3.89	3.79	0.26
olmo_7b	50280	2.01	1.96	0.5
orion_14b_chat	84608	4.63	4.52	0.22
phi_1	50257	1.31	1.28	0.77
phi_2	50257	1.31	1.28	0.77
pko_t5_large	50258	0.97	0.95	1.03
prompt_clue	32128	4.34	4.24	0.23
qwen1_5_14b_chat	151643	4.16	4.06	0.24
qwen_1_8b_chat	151851	4.16	4.06	0.24
qwen_72b_chat	151851	4.16	4.06	0.24
qwen_7b_chat	151851	4.16	4.06	0.24
roberta_chinese_clue	8021	2.7	2.64	0.37
skywork_13b_base	65519	3.69	3.61	0.27
skywork_13b_math	65519	3.69	3.61	0.27
solar_10_7b	32000	2.36	2.3	0.42
starchat_alpha	49152	2.78	2.72	0.36
switch_c_2048	32100	14.13	13.8	0.07
t5_base	32100	14.13	13.8	0.07
t5_large	32100	14.13	13.8	0.07
t5_small	32100	14.13	13.8	0.07
text_davinci_003	50281	1.31	1.28	0.77
tigerbot_13b_chat_v2	60512	4.25	4.15	0.24
tigerbot_70b_chat_v4_4k	65107	4.25	4.15	0.24
wizardcoder_15b_v1	49152	2.78	2.72	0.36
wizardcoder_python_7b_v1	32000	1.84	1.8	0.54
wizardlm_7b_v1	32000	1.84	1.8	0.54
wizardmath_70b_v1	32000	1.84	1.8	0.54
xlm_roberta	250002	3.96	3.86	0.25
yi_34b	64000	4.17	4.07	0.24
yi_6b	64000	4.17	4.07	0.24
yi_vl34b	64000	4.11	4.02	0.24
zephyr_7b_beta	32000	2.36	2.3	0.42

结论 larger vocabulary sizes

Reference

Getting the most out of your tokenizer for pre-training and domain adaptation
Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca
https://huggingface.co/spaces/Xenova/the-tokenizer-playground