File size: 9,060 Bytes
57f1fd6
428b731
751936e
 
 
57f1fd6
 
 
 
 
 
428b731
 
7d2062e
814ee6b
 
480ae5d
814ee6b
480ae5d
 
 
 
814ee6b
 
 
 
 
480ae5d
 
814ee6b
 
480ae5d
814ee6b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
---
title: Tokenizer Arena
emoji: 
colorFrom: red
colorTo: gray
sdk: gradio
sdk_version: 3.41.2
app_file: app.py
pinned: false
---



## 压缩率 Compress Rate


在 [cc-100](https://huggingface.co/datasets/cc100) 数据集,每个语言取1万条数据,测试不同tokenizer的压缩率。压缩率指标 `g_bytes/b_tokens`

您可通过以下脚本进行复现 
```sh
python utils/compress_rate_util.py 
```





<details> <summary>简体中文压缩率</summary>
在简体中文数据集 cc100-zh-Hans 计算压缩率 


</details>

| tokenizer                   |   vocab_size |   g_bytes/b_tokens |   t_bytes/t_tokens |   b_tokens/g_bytes |
|:----------------------------|-------------:|-------------------:|-------------------:|-------------------:|
| amber                       |        32000 |               1.84 |               1.8  |               0.54 |
| aya_101                     |       250100 |               3.89 |               3.79 |               0.26 |
| baichuan                    |        64000 |               3.92 |               3.82 |               0.26 |
| baichuan2                   |       125696 |               4.53 |               4.42 |               0.22 |
| bert_base_cased             |        28996 |               2.73 |               2.66 |               0.37 |
| bert_base_chinese           |        21128 |               2.74 |               2.67 |               0.37 |
| bert_base_uncased           |        30522 |               2.73 |               2.67 |               0.37 |
| bloom                       |       250680 |               4.28 |               4.18 |               0.23 |
| byt5_small                  |          256 |               0.93 |               0.91 |               1.08 |
| character_glm_6b            |        64794 |               4.2  |               4.1  |               0.24 |
| chatglm2_6b                 |        64794 |               4.2  |               4.1  |               0.24 |
| chatglm3_6b                 |        64798 |               4.2  |               4.1  |               0.24 |
| chatglm_6b                  |       150344 |               4.65 |               4.54 |               0.22 |
| chatyuan_large_v2           |        32128 |               4.34 |               4.24 |               0.23 |
| chinese_llama               |        49953 |               3.93 |               3.84 |               0.25 |
| chinese_llama2              |        55296 |               3.92 |               3.83 |               0.26 |
| code_davinci_002            |        50281 |               1.31 |               1.28 |               0.77 |
| crystal_coder               |        32000 |               1.86 |               1.81 |               0.54 |
| deepseek_coder_33b_instruct |        32000 |               3.4  |               3.32 |               0.29 |
| deepseek_llm_7b_base        |       100000 |               4.05 |               3.96 |               0.25 |
| falcon_180b                 |        65024 |               2.18 |               2.13 |               0.46 |
| falcon_7b                   |        65024 |               2.18 |               2.13 |               0.46 |
| fastchat_t5_3b              |        32000 |              13.7  |              13.38 |               0.07 |
| flan_t5_base                |        32100 |              14.13 |              13.8  |               0.07 |
| gemma_7b                    |       256000 |               3.82 |               3.73 |               0.26 |
| gpt2                        |        50257 |               1.31 |               1.28 |               0.77 |
| gpt2_chinese                |        21128 |               2.73 |               2.66 |               0.37 |
| gpt_35_turbo                |       100277 |               2.26 |               2.21 |               0.44 |
| gpt_4                       |       100277 |               2.26 |               2.21 |               0.44 |
| gpt_nexo_20b                |        50254 |               2.01 |               1.96 |               0.5  |
| internlm2_chat_7b           |        92544 |               4.23 |               4.13 |               0.24 |
| internlm2_math_7b           |        92544 |               4.23 |               4.13 |               0.24 |
| internlm_chat_7b            |       103168 |               4.23 |               4.14 |               0.24 |
| internlm_xcomposer_7b       |       103168 |               4.23 |               4.14 |               0.24 |
| kplug                       |        10261 |               2.72 |               2.65 |               0.37 |
| llama                       |        32000 |               1.84 |               1.8  |               0.54 |
| llama2                      |        32000 |               1.84 |               1.8  |               0.54 |
| mistral_7b                  |        32000 |               2.36 |               2.3  |               0.42 |
| mixtral_8_7b                |        32000 |               2.36 |               2.3  |               0.42 |
| mobilebert_uncased          |        30522 |               2.73 |               2.67 |               0.37 |
| moss                        |       106029 |               4.4  |               4.3  |               0.23 |
| mt5_large                   |       250100 |               3.89 |               3.79 |               0.26 |
| olmo_7b                     |        50280 |               2.01 |               1.96 |               0.5  |
| orion_14b_chat              |        84608 |               4.63 |               4.52 |               0.22 |
| phi_1                       |        50257 |               1.31 |               1.28 |               0.77 |
| phi_2                       |        50257 |               1.31 |               1.28 |               0.77 |
| pko_t5_large                |        50258 |               0.97 |               0.95 |               1.03 |
| prompt_clue                 |        32128 |               4.34 |               4.24 |               0.23 |
| qwen1_5_14b_chat            |       151643 |               4.16 |               4.06 |               0.24 |
| qwen_1_8b_chat              |       151851 |               4.16 |               4.06 |               0.24 |
| qwen_72b_chat               |       151851 |               4.16 |               4.06 |               0.24 |
| qwen_7b_chat                |       151851 |               4.16 |               4.06 |               0.24 |
| roberta_chinese_clue        |         8021 |               2.7  |               2.64 |               0.37 |
| skywork_13b_base            |        65519 |               3.69 |               3.61 |               0.27 |
| skywork_13b_math            |        65519 |               3.69 |               3.61 |               0.27 |
| solar_10_7b                 |        32000 |               2.36 |               2.3  |               0.42 |
| starchat_alpha              |        49152 |               2.78 |               2.72 |               0.36 |
| switch_c_2048               |        32100 |              14.13 |              13.8  |               0.07 |
| t5_base                     |        32100 |              14.13 |              13.8  |               0.07 |
| t5_large                    |        32100 |              14.13 |              13.8  |               0.07 |
| t5_small                    |        32100 |              14.13 |              13.8  |               0.07 |
| text_davinci_003            |        50281 |               1.31 |               1.28 |               0.77 |
| tigerbot_13b_chat_v2        |        60512 |               4.25 |               4.15 |               0.24 |
| tigerbot_70b_chat_v4_4k     |        65107 |               4.25 |               4.15 |               0.24 |
| wizardcoder_15b_v1          |        49152 |               2.78 |               2.72 |               0.36 |
| wizardcoder_python_7b_v1    |        32000 |               1.84 |               1.8  |               0.54 |
| wizardlm_7b_v1              |        32000 |               1.84 |               1.8  |               0.54 |
| wizardmath_70b_v1           |        32000 |               1.84 |               1.8  |               0.54 |
| xlm_roberta                 |       250002 |               3.96 |               3.86 |               0.25 |
| yi_34b                      |        64000 |               4.17 |               4.07 |               0.24 |
| yi_6b                       |        64000 |               4.17 |               4.07 |               0.24 |
| yi_vl34b                    |        64000 |               4.11 |               4.02 |               0.24 |
| zephyr_7b_beta              |        32000 |               2.36 |               2.3  |               0.42 |


**结论**
larger vocabulary sizes 



## Reference

- Getting the most out of your tokenizer for pre-training and domain adaptation
- Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca
- https://huggingface.co/spaces/Xenova/the-tokenizer-playground