File size: 26,071 Bytes
57f1fd6
428b731
751936e
 
 
57f1fd6
f331792
57f1fd6
 
2bd606a
 
57f1fd6
 
428b731
 
7d2062e
814ee6b
 
988921c
814ee6b
988921c
 
 
 
 
 
 
 
 
480ae5d
 
 
814ee6b
 
 
 
988921c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
814ee6b
480ae5d
 
814ee6b
988921c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
814ee6b
480ae5d
814ee6b
 
 
 
 
 
 
 
2bd606a
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
---
title: Tokenizer Arena
emoji: 
colorFrom: red
colorTo: gray
sdk: gradio
sdk_version: 4.28.3
app_file: app.py
pinned: false
datasets:
  - cc100
---



## 压缩率 Compress Rate


在 [cc-100](https://huggingface.co/datasets/cc100) 数据集,每个语言取1万条数据,测试不同tokenizer的压缩率。

> 压缩率示例:
llama3扩充了词典,具有更高的压缩比。同样1T字节的简体中文语料,llama分词后是 0.56万亿个token,llama3只需要0.31万亿个token。

| tokenizer                    |   vocab_size |    t_bytes/t_tokens |   t_tokens/t_bytes |   n_chars/n_tokens |
|:-----------------------------|-------------:|-------------------:|-------------------:|-------------------:|
| llama                        |        32000 |               1.8  |               0.56 |               0.7  |
| llama3                       |       128000 |               3.2  |               0.31 |               1.24 |

可通过以下脚本进行复现 
```sh
python utils/compress_rate_util.py 
```




<details> <summary>英文压缩率</summary>
在英文数据集 cc100-en 计算压缩率 

| tokenizer                   |   vocab_size |   g_bytes/b_tokens |   b_tokens/g_bytes |   t_bytes/t_tokens |   t_tokens/t_bytes |   n_chars/n_tokens |
|:----------------------------|-------------:|-------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
| amber                       |        32000 |               3.56 |               0.28 |               3.47 |               0.29 |               3.81 |
| aya_101                     |       250100 |               3.3  |               0.3  |               3.22 |               0.31 |               3.53 |
| baichuan                    |        64000 |               3.74 |               0.27 |               3.65 |               0.27 |               4    |
| baichuan2                   |       125696 |               3.89 |               0.26 |               3.8  |               0.26 |               4.17 |
| bert_base_cased             |        28996 |               3.64 |               0.27 |               3.55 |               0.28 |               3.89 |
| bert_base_chinese           |        21128 |               2.78 |               0.36 |               2.71 |               0.37 |               2.97 |
| bert_base_uncased           |        30522 |               3.73 |               0.27 |               3.65 |               0.27 |               4    |
| bloom                       |       250680 |               4.07 |               0.25 |               3.97 |               0.25 |               4.36 |
| byt5_small                  |          256 |               0.92 |               1.08 |               0.9  |               1.11 |               0.99 |
| character_glm_6b            |        64794 |               3.62 |               0.28 |               3.54 |               0.28 |               3.88 |
| chatglm2_6b                 |        64794 |               3.62 |               0.28 |               3.54 |               0.28 |               3.88 |
| chatglm3_6b                 |        64798 |               3.62 |               0.28 |               3.54 |               0.28 |               3.88 |
| chatglm_6b                  |       150344 |               3.68 |               0.27 |               3.59 |               0.28 |               3.94 |
| chatyuan_large_v2           |        32128 |               1.95 |               0.51 |               1.91 |               0.52 |               2.09 |
| chinese_llama               |        49953 |               3.59 |               0.28 |               3.51 |               0.28 |               3.85 |
| chinese_llama2              |        55296 |               3.56 |               0.28 |               3.47 |               0.29 |               3.81 |
| code_davinci_002            |        50281 |               4.05 |               0.25 |               3.96 |               0.25 |               4.34 |
| crystal_coder               |        32000 |               3.68 |               0.27 |               3.59 |               0.28 |               3.94 |
| dbrx_instruct               |       100277 |               4.11 |               0.24 |               4.01 |               0.25 |               4.4  |
| deepseek_coder_33b_instruct |        32000 |               3.64 |               0.27 |               3.56 |               0.28 |               3.9  |
| deepseek_llm_7b_base        |       100000 |               3.85 |               0.26 |               3.76 |               0.27 |               4.12 |
| falcon_180b                 |        65024 |               3.99 |               0.25 |               3.9  |               0.26 |               4.27 |
| falcon_7b                   |        65024 |               3.99 |               0.25 |               3.9  |               0.26 |               4.27 |
| fastchat_t5_3b              |        32000 |               2.16 |               0.46 |               2.11 |               0.47 |               2.31 |
| flan_t5_base                |        32100 |               3.61 |               0.28 |               3.53 |               0.28 |               3.87 |
| gemma_7b                    |       256000 |               3.91 |               0.26 |               3.82 |               0.26 |               4.18 |
| gpt2                        |        50257 |               4.05 |               0.25 |               3.96 |               0.25 |               4.34 |
| gpt2_chinese                |        21128 |               2.67 |               0.37 |               2.61 |               0.38 |               2.86 |
| gpt_35_turbo                |       100277 |               4.11 |               0.24 |               4.01 |               0.25 |               4.4  |
| gpt_4                       |       100277 |               4.11 |               0.24 |               4.01 |               0.25 |               4.4  |
| gpt_nexo_20b                |        50254 |               4.04 |               0.25 |               3.94 |               0.25 |               4.32 |
| grok_1                      |       131072 |               4.06 |               0.25 |               3.96 |               0.25 |               4.35 |
| internlm2_chat_7b           |        92544 |               3.86 |               0.26 |               3.77 |               0.27 |               4.13 |
| internlm2_math_7b           |        92544 |               3.86 |               0.26 |               3.77 |               0.27 |               4.13 |
| internlm_chat_7b            |       103168 |               3.86 |               0.26 |               3.77 |               0.27 |               4.13 |
| internlm_xcomposer_7b       |       103168 |               3.86 |               0.26 |               3.77 |               0.27 |               4.13 |
| jamba_v0_1                  |        65536 |               3.82 |               0.26 |               3.73 |               0.27 |               4.09 |
| kplug                       |        10261 |               2.66 |               0.38 |               2.6  |               0.38 |               2.85 |
| llama                       |        32000 |               3.56 |               0.28 |               3.47 |               0.29 |               3.81 |
| llama2                      |        32000 |               3.56 |               0.28 |               3.47 |               0.29 |               3.81 |
| llama3                      |       128000 |               4.11 |               0.24 |               4.01 |               0.25 |               4.4  |
| mistral_7b                  |        32000 |               3.67 |               0.27 |               3.58 |               0.28 |               3.92 |
| mixtral_8_7b                |        32000 |               3.67 |               0.27 |               3.58 |               0.28 |               3.92 |
| mobilebert_uncased          |        30522 |               3.73 |               0.27 |               3.65 |               0.27 |               4    |
| moss                        |       106029 |               4.08 |               0.25 |               3.98 |               0.25 |               4.36 |
| mt5_large                   |       250100 |               3.3  |               0.3  |               3.22 |               0.31 |               3.53 |
| olmo_7b                     |        50280 |               4.04 |               0.25 |               3.94 |               0.25 |               4.32 |
| orion_14b_chat              |        84608 |               3.94 |               0.25 |               3.85 |               0.26 |               4.22 |
| phi_1                       |        50257 |               4.05 |               0.25 |               3.96 |               0.25 |               4.34 |
| phi_2                       |        50257 |               4.05 |               0.25 |               3.96 |               0.25 |               4.34 |
| pko_t5_large                |        50258 |               1.59 |               0.63 |               1.55 |               0.64 |               1.7  |
| prompt_clue                 |        32128 |               1.95 |               0.51 |               1.91 |               0.52 |               2.09 |
| qwen1_5_14b_chat            |       151643 |               4.06 |               0.25 |               3.97 |               0.25 |               4.35 |
| qwen_1_8b_chat              |       151851 |               4.06 |               0.25 |               3.97 |               0.25 |               4.35 |
| qwen_72b_chat               |       151851 |               4.06 |               0.25 |               3.97 |               0.25 |               4.35 |
| qwen_7b_chat                |       151851 |               4.06 |               0.25 |               3.97 |               0.25 |               4.35 |
| roberta_chinese_clue        |         8021 |               1.8  |               0.56 |               1.75 |               0.57 |               1.92 |
| skywork_13b_base            |        65519 |               3.56 |               0.28 |               3.47 |               0.29 |               3.81 |
| skywork_13b_math            |        65519 |               3.56 |               0.28 |               3.47 |               0.29 |               3.81 |
| solar_10_7b                 |        32000 |               3.67 |               0.27 |               3.58 |               0.28 |               3.92 |
| starchat_alpha              |        49152 |               3.63 |               0.28 |               3.54 |               0.28 |               3.88 |
| switch_c_2048               |        32100 |               3.61 |               0.28 |               3.53 |               0.28 |               3.87 |
| t5_base                     |        32100 |               3.61 |               0.28 |               3.53 |               0.28 |               3.87 |
| t5_large                    |        32100 |               3.61 |               0.28 |               3.53 |               0.28 |               3.87 |
| t5_small                    |        32100 |               3.61 |               0.28 |               3.53 |               0.28 |               3.87 |
| text_davinci_003            |        50281 |               4.05 |               0.25 |               3.96 |               0.25 |               4.34 |
| tigerbot_13b_chat_v2        |        60512 |               3.67 |               0.27 |               3.58 |               0.28 |               3.93 |
| tigerbot_70b_chat_v4_4k     |        65107 |               3.65 |               0.27 |               3.57 |               0.28 |               3.91 |
| wizardcoder_15b_v1          |        49152 |               3.63 |               0.28 |               3.54 |               0.28 |               3.88 |
| wizardcoder_python_7b_v1    |        32000 |               3.56 |               0.28 |               3.47 |               0.29 |               3.81 |
| wizardlm_7b_v1              |        32000 |               3.56 |               0.28 |               3.47 |               0.29 |               3.81 |
| wizardmath_70b_v1           |        32000 |               3.56 |               0.28 |               3.47 |               0.29 |               3.81 |
| xlm_roberta                 |       250002 |               3.49 |               0.29 |               3.41 |               0.29 |               3.74 |
| yi_34b                      |        64000 |               3.87 |               0.26 |               3.78 |               0.26 |               4.15 |
| yi_6b                       |        64000 |               3.87 |               0.26 |               3.78 |               0.26 |               4.15 |
| yi_vl34b                    |        64000 |               3.88 |               0.26 |               3.79 |               0.26 |               4.16 |
| zephyr_7b_beta              |        32000 |               3.67 |               0.27 |               3.58 |               0.28 |               3.92 |

</details>


<details> <summary>简体中文压缩率</summary>
在简体中文数据集 cc100-zh-Hans 计算压缩率 

| tokenizer                   |   vocab_size |   g_bytes/b_tokens |   b_tokens/g_bytes |   t_bytes/t_tokens |   t_tokens/t_bytes |   n_chars/n_tokens |
|:----------------------------|-------------:|-------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
| amber                       |        32000 |               1.84 |               0.54 |               1.8  |               0.56 |               0.7  |
| aya_101                     |       250100 |               3.89 |               0.26 |               3.79 |               0.26 |               1.47 |
| baichuan                    |        64000 |               3.92 |               0.26 |               3.82 |               0.26 |               1.48 |
| baichuan2                   |       125696 |               4.53 |               0.22 |               4.42 |               0.23 |               1.71 |
| bert_base_cased             |        28996 |               2.73 |               0.37 |               2.66 |               0.38 |               1.03 |
| bert_base_chinese           |        21128 |               2.74 |               0.37 |               2.67 |               0.37 |               1.03 |
| bert_base_uncased           |        30522 |               2.73 |               0.37 |               2.67 |               0.38 |               1.03 |
| bloom                       |       250680 |               4.28 |               0.23 |               4.18 |               0.24 |               1.62 |
| byt5_small                  |          256 |               0.93 |               1.08 |               0.91 |               1.1  |               0.35 |
| character_glm_6b            |        64794 |               4.2  |               0.24 |               4.1  |               0.24 |               1.59 |
| chatglm2_6b                 |        64794 |               4.2  |               0.24 |               4.1  |               0.24 |               1.59 |
| chatglm3_6b                 |        64798 |               4.2  |               0.24 |               4.1  |               0.24 |               1.59 |
| chatglm_6b                  |       150344 |               4.65 |               0.22 |               4.54 |               0.22 |               1.76 |
| chatyuan_large_v2           |        32128 |               4.34 |               0.23 |               4.24 |               0.24 |               1.64 |
| chinese_llama               |        49953 |               3.93 |               0.25 |               3.84 |               0.26 |               1.49 |
| chinese_llama2              |        55296 |               3.92 |               0.26 |               3.83 |               0.26 |               1.48 |
| code_davinci_002            |        50281 |               1.31 |               0.77 |               1.28 |               0.78 |               0.49 |
| crystal_coder               |        32000 |               1.86 |               0.54 |               1.81 |               0.55 |               0.7  |
| dbrx_instruct               |       100277 |               2.26 |               0.44 |               2.21 |               0.45 |               0.85 |
| deepseek_coder_33b_instruct |        32000 |               3.4  |               0.29 |               3.32 |               0.3  |               1.29 |
| deepseek_llm_7b_base        |       100000 |               4.05 |               0.25 |               3.96 |               0.25 |               1.53 |
| falcon_180b                 |        65024 |               2.18 |               0.46 |               2.13 |               0.47 |               0.82 |
| falcon_7b                   |        65024 |               2.18 |               0.46 |               2.13 |               0.47 |               0.82 |
| fastchat_t5_3b              |        32000 |              13.7  |               0.07 |              13.38 |               0.07 |               5.18 |
| flan_t5_base                |        32100 |              14.13 |               0.07 |              13.8  |               0.07 |               5.34 |
| gemma_7b                    |       256000 |               3.82 |               0.26 |               3.73 |               0.27 |               1.44 |
| gpt2                        |        50257 |               1.31 |               0.77 |               1.28 |               0.78 |               0.49 |
| gpt2_chinese                |        21128 |               2.73 |               0.37 |               2.66 |               0.38 |               1.03 |
| gpt_35_turbo                |       100277 |               2.26 |               0.44 |               2.21 |               0.45 |               0.85 |
| gpt_4                       |       100277 |               2.26 |               0.44 |               2.21 |               0.45 |               0.85 |
| gpt_nexo_20b                |        50254 |               2.01 |               0.5  |               1.96 |               0.51 |               0.76 |
| grok_1                      |       131072 |               1.73 |               0.58 |               1.69 |               0.59 |               0.66 |
| internlm2_chat_7b           |        92544 |               4.23 |               0.24 |               4.13 |               0.24 |               1.6  |
| internlm2_math_7b           |        92544 |               4.23 |               0.24 |               4.13 |               0.24 |               1.6  |
| internlm_chat_7b            |       103168 |               4.23 |               0.24 |               4.14 |               0.24 |               1.6  |
| internlm_xcomposer_7b       |       103168 |               4.23 |               0.24 |               4.14 |               0.24 |               1.6  |
| jamba_v0_1                  |        65536 |               2.3  |               0.44 |               2.24 |               0.45 |               0.87 |
| kplug                       |        10261 |               2.72 |               0.37 |               2.65 |               0.38 |               1.03 |
| llama                       |        32000 |               1.84 |               0.54 |               1.8  |               0.56 |               0.7  |
| llama2                      |        32000 |               1.84 |               0.54 |               1.8  |               0.56 |               0.7  |
| llama3                      |       128000 |               3.28 |               0.3  |               3.2  |               0.31 |               1.24 |
| mistral_7b                  |        32000 |               2.36 |               0.42 |               2.3  |               0.43 |               0.89 |
| mixtral_8_7b                |        32000 |               2.36 |               0.42 |               2.3  |               0.43 |               0.89 |
| mobilebert_uncased          |        30522 |               2.73 |               0.37 |               2.67 |               0.38 |               1.03 |
| moss                        |       106029 |               4.4  |               0.23 |               4.3  |               0.23 |               1.66 |
| mt5_large                   |       250100 |               3.89 |               0.26 |               3.79 |               0.26 |               1.47 |
| olmo_7b                     |        50280 |               2.01 |               0.5  |               1.96 |               0.51 |               0.76 |
| orion_14b_chat              |        84608 |               4.63 |               0.22 |               4.52 |               0.22 |               1.75 |
| phi_1                       |        50257 |               1.31 |               0.77 |               1.28 |               0.78 |               0.49 |
| phi_2                       |        50257 |               1.31 |               0.77 |               1.28 |               0.78 |               0.49 |
| pko_t5_large                |        50258 |               0.97 |               1.03 |               0.95 |               1.06 |               0.37 |
| prompt_clue                 |        32128 |               4.34 |               0.23 |               4.24 |               0.24 |               1.64 |
| qwen1_5_14b_chat            |       151643 |               4.16 |               0.24 |               4.06 |               0.25 |               1.57 |
| qwen_1_8b_chat              |       151851 |               4.16 |               0.24 |               4.06 |               0.25 |               1.57 |
| qwen_72b_chat               |       151851 |               4.16 |               0.24 |               4.06 |               0.25 |               1.57 |
| qwen_7b_chat                |       151851 |               4.16 |               0.24 |               4.06 |               0.25 |               1.57 |
| roberta_chinese_clue        |         8021 |               2.7  |               0.37 |               2.64 |               0.38 |               1.02 |
| skywork_13b_base            |        65519 |               3.69 |               0.27 |               3.61 |               0.28 |               1.4  |
| skywork_13b_math            |        65519 |               3.69 |               0.27 |               3.61 |               0.28 |               1.4  |
| solar_10_7b                 |        32000 |               2.36 |               0.42 |               2.3  |               0.43 |               0.89 |
| starchat_alpha              |        49152 |               2.78 |               0.36 |               2.72 |               0.37 |               1.05 |
| switch_c_2048               |        32100 |              14.13 |               0.07 |              13.8  |               0.07 |               5.34 |
| t5_base                     |        32100 |              14.13 |               0.07 |              13.8  |               0.07 |               5.34 |
| t5_large                    |        32100 |              14.13 |               0.07 |              13.8  |               0.07 |               5.34 |
| t5_small                    |        32100 |              14.13 |               0.07 |              13.8  |               0.07 |               5.34 |
| text_davinci_003            |        50281 |               1.31 |               0.77 |               1.28 |               0.78 |               0.49 |
| tigerbot_13b_chat_v2        |        60512 |               4.25 |               0.24 |               4.15 |               0.24 |               1.61 |
| tigerbot_70b_chat_v4_4k     |        65107 |               4.25 |               0.24 |               4.15 |               0.24 |               1.61 |
| wizardcoder_15b_v1          |        49152 |               2.78 |               0.36 |               2.72 |               0.37 |               1.05 |
| wizardcoder_python_7b_v1    |        32000 |               1.84 |               0.54 |               1.8  |               0.56 |               0.7  |
| wizardlm_7b_v1              |        32000 |               1.84 |               0.54 |               1.8  |               0.56 |               0.7  |
| wizardmath_70b_v1           |        32000 |               1.84 |               0.54 |               1.8  |               0.56 |               0.7  |
| xlm_roberta                 |       250002 |               3.96 |               0.25 |               3.86 |               0.26 |               1.5  |
| yi_34b                      |        64000 |               4.17 |               0.24 |               4.07 |               0.25 |               1.58 |
| yi_6b                       |        64000 |               4.17 |               0.24 |               4.07 |               0.25 |               1.58 |
| yi_vl34b                    |        64000 |               4.11 |               0.24 |               4.02 |               0.25 |               1.56 |
| zephyr_7b_beta              |        32000 |               2.36 |               0.42 |               2.3  |               0.43 |               0.89 |

</details>




## Reference

- Getting the most out of your tokenizer for pre-training and domain adaptation
- Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca
- blog
  - https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them 
  - https://huggingface.co/docs/transformers/tokenizer_summary#sentencepiece
  - https://www.huaxiaozhuan.com/%E5%B7%A5%E5%85%B7/huggingface_transformer/chapters/1_tokenizer.html
  - https://zhuanlan.zhihu.com/p/652520262
  - https://github.com/QwenLM/Qwen/blob/main/tokenization_note_zh.md
- demo
  - https://huggingface.co/spaces/Xenova/the-tokenizer-playground 
  - https://github.com/dqbd/tiktokenizer
  - https://chat.lmsys.org/?leaderboard
  - https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
- paper
  - ss
-