Spaces:

eson
/

tokenizer-arena

Running

App Files Files Community

tokenizer-arena / README.md

eson

remove vocabs; update compression_app; add character_app;

2bd606a about 1 month ago

preview code

raw history blame

No virus

26.1 kB

	---
	title: Tokenizer Arena
	emoji: ⚡
	colorFrom: red
	colorTo: gray
	sdk: gradio
	sdk_version: 4.28.3
	app_file: app.py
	pinned: false
	datasets:
	- cc100
	---



	## 压缩率 Compress Rate


	在 [cc-100](https://huggingface.co/datasets/cc100) 数据集，每个语言取1万条数据，测试不同tokenizer的压缩率。

	> 压缩率示例：
	llama3扩充了词典，具有更高的压缩比。同样1T字节的简体中文语料，llama分词后是 0.56万亿个token，llama3只需要0.31万亿个token。

	\| tokenizer \| vocab_size \| t_bytes/t_tokens \| t_tokens/t_bytes \| n_chars/n_tokens \|
	\|:-----------------------------\|-------------:\|-------------------:\|-------------------:\|-------------------:\|
	\| llama \| 32000 \| 1.8 \| 0.56 \| 0.7 \|
	\| llama3 \| 128000 \| 3.2 \| 0.31 \| 1.24 \|

	可通过以下脚本进行复现
	```sh
	python utils/compress_rate_util.py
	```




	<details> <summary>英文压缩率</summary>
	在英文数据集 cc100-en 计算压缩率

	\| tokenizer \| vocab_size \| g_bytes/b_tokens \| b_tokens/g_bytes \| t_bytes/t_tokens \| t_tokens/t_bytes \| n_chars/n_tokens \|
	\|:----------------------------\|-------------:\|-------------------:\|-------------------:\|-------------------:\|-------------------:\|-------------------:\|
	\| amber \| 32000 \| 3.56 \| 0.28 \| 3.47 \| 0.29 \| 3.81 \|
	\| aya_101 \| 250100 \| 3.3 \| 0.3 \| 3.22 \| 0.31 \| 3.53 \|
	\| baichuan \| 64000 \| 3.74 \| 0.27 \| 3.65 \| 0.27 \| 4 \|
	\| baichuan2 \| 125696 \| 3.89 \| 0.26 \| 3.8 \| 0.26 \| 4.17 \|
	\| bert_base_cased \| 28996 \| 3.64 \| 0.27 \| 3.55 \| 0.28 \| 3.89 \|
	\| bert_base_chinese \| 21128 \| 2.78 \| 0.36 \| 2.71 \| 0.37 \| 2.97 \|
	\| bert_base_uncased \| 30522 \| 3.73 \| 0.27 \| 3.65 \| 0.27 \| 4 \|
	\| bloom \| 250680 \| 4.07 \| 0.25 \| 3.97 \| 0.25 \| 4.36 \|
	\| byt5_small \| 256 \| 0.92 \| 1.08 \| 0.9 \| 1.11 \| 0.99 \|
	\| character_glm_6b \| 64794 \| 3.62 \| 0.28 \| 3.54 \| 0.28 \| 3.88 \|
	\| chatglm2_6b \| 64794 \| 3.62 \| 0.28 \| 3.54 \| 0.28 \| 3.88 \|
	\| chatglm3_6b \| 64798 \| 3.62 \| 0.28 \| 3.54 \| 0.28 \| 3.88 \|
	\| chatglm_6b \| 150344 \| 3.68 \| 0.27 \| 3.59 \| 0.28 \| 3.94 \|
	\| chatyuan_large_v2 \| 32128 \| 1.95 \| 0.51 \| 1.91 \| 0.52 \| 2.09 \|
	\| chinese_llama \| 49953 \| 3.59 \| 0.28 \| 3.51 \| 0.28 \| 3.85 \|
	\| chinese_llama2 \| 55296 \| 3.56 \| 0.28 \| 3.47 \| 0.29 \| 3.81 \|
	\| code_davinci_002 \| 50281 \| 4.05 \| 0.25 \| 3.96 \| 0.25 \| 4.34 \|
	\| crystal_coder \| 32000 \| 3.68 \| 0.27 \| 3.59 \| 0.28 \| 3.94 \|
	\| dbrx_instruct \| 100277 \| 4.11 \| 0.24 \| 4.01 \| 0.25 \| 4.4 \|
	\| deepseek_coder_33b_instruct \| 32000 \| 3.64 \| 0.27 \| 3.56 \| 0.28 \| 3.9 \|
	\| deepseek_llm_7b_base \| 100000 \| 3.85 \| 0.26 \| 3.76 \| 0.27 \| 4.12 \|
	\| falcon_180b \| 65024 \| 3.99 \| 0.25 \| 3.9 \| 0.26 \| 4.27 \|
	\| falcon_7b \| 65024 \| 3.99 \| 0.25 \| 3.9 \| 0.26 \| 4.27 \|
	\| fastchat_t5_3b \| 32000 \| 2.16 \| 0.46 \| 2.11 \| 0.47 \| 2.31 \|
	\| flan_t5_base \| 32100 \| 3.61 \| 0.28 \| 3.53 \| 0.28 \| 3.87 \|
	\| gemma_7b \| 256000 \| 3.91 \| 0.26 \| 3.82 \| 0.26 \| 4.18 \|
	\| gpt2 \| 50257 \| 4.05 \| 0.25 \| 3.96 \| 0.25 \| 4.34 \|
	\| gpt2_chinese \| 21128 \| 2.67 \| 0.37 \| 2.61 \| 0.38 \| 2.86 \|
	\| gpt_35_turbo \| 100277 \| 4.11 \| 0.24 \| 4.01 \| 0.25 \| 4.4 \|
	\| gpt_4 \| 100277 \| 4.11 \| 0.24 \| 4.01 \| 0.25 \| 4.4 \|
	\| gpt_nexo_20b \| 50254 \| 4.04 \| 0.25 \| 3.94 \| 0.25 \| 4.32 \|
	\| grok_1 \| 131072 \| 4.06 \| 0.25 \| 3.96 \| 0.25 \| 4.35 \|
	\| internlm2_chat_7b \| 92544 \| 3.86 \| 0.26 \| 3.77 \| 0.27 \| 4.13 \|
	\| internlm2_math_7b \| 92544 \| 3.86 \| 0.26 \| 3.77 \| 0.27 \| 4.13 \|
	\| internlm_chat_7b \| 103168 \| 3.86 \| 0.26 \| 3.77 \| 0.27 \| 4.13 \|
	\| internlm_xcomposer_7b \| 103168 \| 3.86 \| 0.26 \| 3.77 \| 0.27 \| 4.13 \|
	\| jamba_v0_1 \| 65536 \| 3.82 \| 0.26 \| 3.73 \| 0.27 \| 4.09 \|
	\| kplug \| 10261 \| 2.66 \| 0.38 \| 2.6 \| 0.38 \| 2.85 \|
	\| llama \| 32000 \| 3.56 \| 0.28 \| 3.47 \| 0.29 \| 3.81 \|
	\| llama2 \| 32000 \| 3.56 \| 0.28 \| 3.47 \| 0.29 \| 3.81 \|
	\| llama3 \| 128000 \| 4.11 \| 0.24 \| 4.01 \| 0.25 \| 4.4 \|
	\| mistral_7b \| 32000 \| 3.67 \| 0.27 \| 3.58 \| 0.28 \| 3.92 \|
	\| mixtral_8_7b \| 32000 \| 3.67 \| 0.27 \| 3.58 \| 0.28 \| 3.92 \|
	\| mobilebert_uncased \| 30522 \| 3.73 \| 0.27 \| 3.65 \| 0.27 \| 4 \|
	\| moss \| 106029 \| 4.08 \| 0.25 \| 3.98 \| 0.25 \| 4.36 \|
	\| mt5_large \| 250100 \| 3.3 \| 0.3 \| 3.22 \| 0.31 \| 3.53 \|
	\| olmo_7b \| 50280 \| 4.04 \| 0.25 \| 3.94 \| 0.25 \| 4.32 \|
	\| orion_14b_chat \| 84608 \| 3.94 \| 0.25 \| 3.85 \| 0.26 \| 4.22 \|
	\| phi_1 \| 50257 \| 4.05 \| 0.25 \| 3.96 \| 0.25 \| 4.34 \|
	\| phi_2 \| 50257 \| 4.05 \| 0.25 \| 3.96 \| 0.25 \| 4.34 \|
	\| pko_t5_large \| 50258 \| 1.59 \| 0.63 \| 1.55 \| 0.64 \| 1.7 \|
	\| prompt_clue \| 32128 \| 1.95 \| 0.51 \| 1.91 \| 0.52 \| 2.09 \|
	\| qwen1_5_14b_chat \| 151643 \| 4.06 \| 0.25 \| 3.97 \| 0.25 \| 4.35 \|
	\| qwen_1_8b_chat \| 151851 \| 4.06 \| 0.25 \| 3.97 \| 0.25 \| 4.35 \|
	\| qwen_72b_chat \| 151851 \| 4.06 \| 0.25 \| 3.97 \| 0.25 \| 4.35 \|
	\| qwen_7b_chat \| 151851 \| 4.06 \| 0.25 \| 3.97 \| 0.25 \| 4.35 \|
	\| roberta_chinese_clue \| 8021 \| 1.8 \| 0.56 \| 1.75 \| 0.57 \| 1.92 \|
	\| skywork_13b_base \| 65519 \| 3.56 \| 0.28 \| 3.47 \| 0.29 \| 3.81 \|
	\| skywork_13b_math \| 65519 \| 3.56 \| 0.28 \| 3.47 \| 0.29 \| 3.81 \|
	\| solar_10_7b \| 32000 \| 3.67 \| 0.27 \| 3.58 \| 0.28 \| 3.92 \|
	\| starchat_alpha \| 49152 \| 3.63 \| 0.28 \| 3.54 \| 0.28 \| 3.88 \|
	\| switch_c_2048 \| 32100 \| 3.61 \| 0.28 \| 3.53 \| 0.28 \| 3.87 \|
	\| t5_base \| 32100 \| 3.61 \| 0.28 \| 3.53 \| 0.28 \| 3.87 \|
	\| t5_large \| 32100 \| 3.61 \| 0.28 \| 3.53 \| 0.28 \| 3.87 \|
	\| t5_small \| 32100 \| 3.61 \| 0.28 \| 3.53 \| 0.28 \| 3.87 \|
	\| text_davinci_003 \| 50281 \| 4.05 \| 0.25 \| 3.96 \| 0.25 \| 4.34 \|
	\| tigerbot_13b_chat_v2 \| 60512 \| 3.67 \| 0.27 \| 3.58 \| 0.28 \| 3.93 \|
	\| tigerbot_70b_chat_v4_4k \| 65107 \| 3.65 \| 0.27 \| 3.57 \| 0.28 \| 3.91 \|
	\| wizardcoder_15b_v1 \| 49152 \| 3.63 \| 0.28 \| 3.54 \| 0.28 \| 3.88 \|
	\| wizardcoder_python_7b_v1 \| 32000 \| 3.56 \| 0.28 \| 3.47 \| 0.29 \| 3.81 \|
	\| wizardlm_7b_v1 \| 32000 \| 3.56 \| 0.28 \| 3.47 \| 0.29 \| 3.81 \|
	\| wizardmath_70b_v1 \| 32000 \| 3.56 \| 0.28 \| 3.47 \| 0.29 \| 3.81 \|
	\| xlm_roberta \| 250002 \| 3.49 \| 0.29 \| 3.41 \| 0.29 \| 3.74 \|
	\| yi_34b \| 64000 \| 3.87 \| 0.26 \| 3.78 \| 0.26 \| 4.15 \|
	\| yi_6b \| 64000 \| 3.87 \| 0.26 \| 3.78 \| 0.26 \| 4.15 \|
	\| yi_vl34b \| 64000 \| 3.88 \| 0.26 \| 3.79 \| 0.26 \| 4.16 \|
	\| zephyr_7b_beta \| 32000 \| 3.67 \| 0.27 \| 3.58 \| 0.28 \| 3.92 \|

	</details>


	<details> <summary>简体中文压缩率</summary>
	在简体中文数据集 cc100-zh-Hans 计算压缩率

	\| tokenizer \| vocab_size \| g_bytes/b_tokens \| b_tokens/g_bytes \| t_bytes/t_tokens \| t_tokens/t_bytes \| n_chars/n_tokens \|
	\|:----------------------------\|-------------:\|-------------------:\|-------------------:\|-------------------:\|-------------------:\|-------------------:\|
	\| amber \| 32000 \| 1.84 \| 0.54 \| 1.8 \| 0.56 \| 0.7 \|
	\| aya_101 \| 250100 \| 3.89 \| 0.26 \| 3.79 \| 0.26 \| 1.47 \|
	\| baichuan \| 64000 \| 3.92 \| 0.26 \| 3.82 \| 0.26 \| 1.48 \|
	\| baichuan2 \| 125696 \| 4.53 \| 0.22 \| 4.42 \| 0.23 \| 1.71 \|
	\| bert_base_cased \| 28996 \| 2.73 \| 0.37 \| 2.66 \| 0.38 \| 1.03 \|
	\| bert_base_chinese \| 21128 \| 2.74 \| 0.37 \| 2.67 \| 0.37 \| 1.03 \|
	\| bert_base_uncased \| 30522 \| 2.73 \| 0.37 \| 2.67 \| 0.38 \| 1.03 \|
	\| bloom \| 250680 \| 4.28 \| 0.23 \| 4.18 \| 0.24 \| 1.62 \|
	\| byt5_small \| 256 \| 0.93 \| 1.08 \| 0.91 \| 1.1 \| 0.35 \|
	\| character_glm_6b \| 64794 \| 4.2 \| 0.24 \| 4.1 \| 0.24 \| 1.59 \|
	\| chatglm2_6b \| 64794 \| 4.2 \| 0.24 \| 4.1 \| 0.24 \| 1.59 \|
	\| chatglm3_6b \| 64798 \| 4.2 \| 0.24 \| 4.1 \| 0.24 \| 1.59 \|
	\| chatglm_6b \| 150344 \| 4.65 \| 0.22 \| 4.54 \| 0.22 \| 1.76 \|
	\| chatyuan_large_v2 \| 32128 \| 4.34 \| 0.23 \| 4.24 \| 0.24 \| 1.64 \|
	\| chinese_llama \| 49953 \| 3.93 \| 0.25 \| 3.84 \| 0.26 \| 1.49 \|
	\| chinese_llama2 \| 55296 \| 3.92 \| 0.26 \| 3.83 \| 0.26 \| 1.48 \|
	\| code_davinci_002 \| 50281 \| 1.31 \| 0.77 \| 1.28 \| 0.78 \| 0.49 \|
	\| crystal_coder \| 32000 \| 1.86 \| 0.54 \| 1.81 \| 0.55 \| 0.7 \|
	\| dbrx_instruct \| 100277 \| 2.26 \| 0.44 \| 2.21 \| 0.45 \| 0.85 \|
	\| deepseek_coder_33b_instruct \| 32000 \| 3.4 \| 0.29 \| 3.32 \| 0.3 \| 1.29 \|
	\| deepseek_llm_7b_base \| 100000 \| 4.05 \| 0.25 \| 3.96 \| 0.25 \| 1.53 \|
	\| falcon_180b \| 65024 \| 2.18 \| 0.46 \| 2.13 \| 0.47 \| 0.82 \|
	\| falcon_7b \| 65024 \| 2.18 \| 0.46 \| 2.13 \| 0.47 \| 0.82 \|
	\| fastchat_t5_3b \| 32000 \| 13.7 \| 0.07 \| 13.38 \| 0.07 \| 5.18 \|
	\| flan_t5_base \| 32100 \| 14.13 \| 0.07 \| 13.8 \| 0.07 \| 5.34 \|
	\| gemma_7b \| 256000 \| 3.82 \| 0.26 \| 3.73 \| 0.27 \| 1.44 \|
	\| gpt2 \| 50257 \| 1.31 \| 0.77 \| 1.28 \| 0.78 \| 0.49 \|
	\| gpt2_chinese \| 21128 \| 2.73 \| 0.37 \| 2.66 \| 0.38 \| 1.03 \|
	\| gpt_35_turbo \| 100277 \| 2.26 \| 0.44 \| 2.21 \| 0.45 \| 0.85 \|
	\| gpt_4 \| 100277 \| 2.26 \| 0.44 \| 2.21 \| 0.45 \| 0.85 \|
	\| gpt_nexo_20b \| 50254 \| 2.01 \| 0.5 \| 1.96 \| 0.51 \| 0.76 \|
	\| grok_1 \| 131072 \| 1.73 \| 0.58 \| 1.69 \| 0.59 \| 0.66 \|
	\| internlm2_chat_7b \| 92544 \| 4.23 \| 0.24 \| 4.13 \| 0.24 \| 1.6 \|
	\| internlm2_math_7b \| 92544 \| 4.23 \| 0.24 \| 4.13 \| 0.24 \| 1.6 \|
	\| internlm_chat_7b \| 103168 \| 4.23 \| 0.24 \| 4.14 \| 0.24 \| 1.6 \|
	\| internlm_xcomposer_7b \| 103168 \| 4.23 \| 0.24 \| 4.14 \| 0.24 \| 1.6 \|
	\| jamba_v0_1 \| 65536 \| 2.3 \| 0.44 \| 2.24 \| 0.45 \| 0.87 \|
	\| kplug \| 10261 \| 2.72 \| 0.37 \| 2.65 \| 0.38 \| 1.03 \|
	\| llama \| 32000 \| 1.84 \| 0.54 \| 1.8 \| 0.56 \| 0.7 \|
	\| llama2 \| 32000 \| 1.84 \| 0.54 \| 1.8 \| 0.56 \| 0.7 \|
	\| llama3 \| 128000 \| 3.28 \| 0.3 \| 3.2 \| 0.31 \| 1.24 \|
	\| mistral_7b \| 32000 \| 2.36 \| 0.42 \| 2.3 \| 0.43 \| 0.89 \|
	\| mixtral_8_7b \| 32000 \| 2.36 \| 0.42 \| 2.3 \| 0.43 \| 0.89 \|
	\| mobilebert_uncased \| 30522 \| 2.73 \| 0.37 \| 2.67 \| 0.38 \| 1.03 \|
	\| moss \| 106029 \| 4.4 \| 0.23 \| 4.3 \| 0.23 \| 1.66 \|
	\| mt5_large \| 250100 \| 3.89 \| 0.26 \| 3.79 \| 0.26 \| 1.47 \|
	\| olmo_7b \| 50280 \| 2.01 \| 0.5 \| 1.96 \| 0.51 \| 0.76 \|
	\| orion_14b_chat \| 84608 \| 4.63 \| 0.22 \| 4.52 \| 0.22 \| 1.75 \|
	\| phi_1 \| 50257 \| 1.31 \| 0.77 \| 1.28 \| 0.78 \| 0.49 \|
	\| phi_2 \| 50257 \| 1.31 \| 0.77 \| 1.28 \| 0.78 \| 0.49 \|
	\| pko_t5_large \| 50258 \| 0.97 \| 1.03 \| 0.95 \| 1.06 \| 0.37 \|
	\| prompt_clue \| 32128 \| 4.34 \| 0.23 \| 4.24 \| 0.24 \| 1.64 \|
	\| qwen1_5_14b_chat \| 151643 \| 4.16 \| 0.24 \| 4.06 \| 0.25 \| 1.57 \|
	\| qwen_1_8b_chat \| 151851 \| 4.16 \| 0.24 \| 4.06 \| 0.25 \| 1.57 \|
	\| qwen_72b_chat \| 151851 \| 4.16 \| 0.24 \| 4.06 \| 0.25 \| 1.57 \|
	\| qwen_7b_chat \| 151851 \| 4.16 \| 0.24 \| 4.06 \| 0.25 \| 1.57 \|
	\| roberta_chinese_clue \| 8021 \| 2.7 \| 0.37 \| 2.64 \| 0.38 \| 1.02 \|
	\| skywork_13b_base \| 65519 \| 3.69 \| 0.27 \| 3.61 \| 0.28 \| 1.4 \|
	\| skywork_13b_math \| 65519 \| 3.69 \| 0.27 \| 3.61 \| 0.28 \| 1.4 \|
	\| solar_10_7b \| 32000 \| 2.36 \| 0.42 \| 2.3 \| 0.43 \| 0.89 \|
	\| starchat_alpha \| 49152 \| 2.78 \| 0.36 \| 2.72 \| 0.37 \| 1.05 \|
	\| switch_c_2048 \| 32100 \| 14.13 \| 0.07 \| 13.8 \| 0.07 \| 5.34 \|
	\| t5_base \| 32100 \| 14.13 \| 0.07 \| 13.8 \| 0.07 \| 5.34 \|
	\| t5_large \| 32100 \| 14.13 \| 0.07 \| 13.8 \| 0.07 \| 5.34 \|
	\| t5_small \| 32100 \| 14.13 \| 0.07 \| 13.8 \| 0.07 \| 5.34 \|
	\| text_davinci_003 \| 50281 \| 1.31 \| 0.77 \| 1.28 \| 0.78 \| 0.49 \|
	\| tigerbot_13b_chat_v2 \| 60512 \| 4.25 \| 0.24 \| 4.15 \| 0.24 \| 1.61 \|
	\| tigerbot_70b_chat_v4_4k \| 65107 \| 4.25 \| 0.24 \| 4.15 \| 0.24 \| 1.61 \|
	\| wizardcoder_15b_v1 \| 49152 \| 2.78 \| 0.36 \| 2.72 \| 0.37 \| 1.05 \|
	\| wizardcoder_python_7b_v1 \| 32000 \| 1.84 \| 0.54 \| 1.8 \| 0.56 \| 0.7 \|
	\| wizardlm_7b_v1 \| 32000 \| 1.84 \| 0.54 \| 1.8 \| 0.56 \| 0.7 \|
	\| wizardmath_70b_v1 \| 32000 \| 1.84 \| 0.54 \| 1.8 \| 0.56 \| 0.7 \|
	\| xlm_roberta \| 250002 \| 3.96 \| 0.25 \| 3.86 \| 0.26 \| 1.5 \|
	\| yi_34b \| 64000 \| 4.17 \| 0.24 \| 4.07 \| 0.25 \| 1.58 \|
	\| yi_6b \| 64000 \| 4.17 \| 0.24 \| 4.07 \| 0.25 \| 1.58 \|
	\| yi_vl34b \| 64000 \| 4.11 \| 0.24 \| 4.02 \| 0.25 \| 1.56 \|
	\| zephyr_7b_beta \| 32000 \| 2.36 \| 0.42 \| 2.3 \| 0.43 \| 0.89 \|

	</details>




	## Reference

	- Getting the most out of your tokenizer for pre-training and domain adaptation
	- Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca
	- blog
	- https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
	- https://huggingface.co/docs/transformers/tokenizer_summary#sentencepiece
	- https://www.huaxiaozhuan.com/%E5%B7%A5%E5%85%B7/huggingface_transformer/chapters/1_tokenizer.html
	- https://zhuanlan.zhihu.com/p/652520262
	- https://github.com/QwenLM/Qwen/blob/main/tokenization_note_zh.md
	- demo
	- https://huggingface.co/spaces/Xenova/the-tokenizer-playground
	- https://github.com/dqbd/tiktokenizer
	- https://chat.lmsys.org/?leaderboard
	- https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
	- paper
	- ss
	-