update compress rate
Browse filesThis view is limited to 50 files because it contains too many changes.
See raw diff
- README.md +174 -81
- app.py +2 -2
- stats/README.md +0 -0
- stats/compress_rate/amber.en.json +1 -0
- stats/compress_rate/amber.zh-Hans.json +1 -0
- stats/compress_rate/aya_101.en.json +1 -0
- stats/compress_rate/aya_101.zh-Hans.json +1 -0
- stats/compress_rate/baichuan.en.json +1 -0
- stats/compress_rate/baichuan.zh-Hans.json +1 -0
- stats/compress_rate/baichuan2.en.json +1 -0
- stats/compress_rate/baichuan2.zh-Hans.json +1 -0
- stats/compress_rate/bert_base_cased.en.json +1 -0
- stats/compress_rate/bert_base_cased.zh-Hans.json +1 -0
- stats/compress_rate/bert_base_chinese.en.json +1 -0
- stats/compress_rate/bert_base_chinese.zh-Hans.json +1 -0
- stats/compress_rate/bert_base_uncased.en.json +1 -0
- stats/compress_rate/bert_base_uncased.zh-Hans.json +1 -0
- stats/compress_rate/bloom.en.json +1 -0
- stats/compress_rate/bloom.zh-Hans.json +1 -0
- stats/compress_rate/byt5_small.en.json +1 -0
- stats/compress_rate/byt5_small.zh-Hans.json +1 -0
- stats/compress_rate/character_glm_6b.en.json +1 -0
- stats/compress_rate/character_glm_6b.zh-Hans.json +1 -0
- stats/compress_rate/chatglm2_6b.en.json +1 -0
- stats/compress_rate/chatglm2_6b.zh-Hans.json +1 -0
- stats/compress_rate/chatglm3_6b.en.json +1 -0
- stats/compress_rate/chatglm3_6b.zh-Hans.json +1 -0
- stats/compress_rate/chatglm_6b.en.json +1 -0
- stats/compress_rate/chatglm_6b.zh-Hans.json +1 -0
- stats/compress_rate/chatyuan_large_v2.en.json +1 -0
- stats/compress_rate/chatyuan_large_v2.zh-Hans.json +1 -0
- stats/compress_rate/chinese_llama.en.json +1 -0
- stats/compress_rate/chinese_llama.zh-Hans.json +1 -0
- stats/compress_rate/chinese_llama2.en.json +1 -0
- stats/compress_rate/chinese_llama2.zh-Hans.json +1 -0
- stats/compress_rate/code_davinci_002.en.json +1 -0
- stats/compress_rate/code_davinci_002.zh-Hans.json +1 -0
- stats/compress_rate/crystal_coder.en.json +1 -0
- stats/compress_rate/crystal_coder.zh-Hans.json +1 -0
- stats/compress_rate/dbrx_instruct.en.json +1 -0
- stats/compress_rate/dbrx_instruct.zh-Hans.json +1 -0
- stats/compress_rate/deepseek_coder_33b_instruct.en.json +1 -0
- stats/compress_rate/deepseek_coder_33b_instruct.zh-Hans.json +1 -0
- stats/compress_rate/deepseek_llm_7b_base.en.json +1 -0
- stats/compress_rate/deepseek_llm_7b_base.zh-Hans.json +1 -0
- stats/compress_rate/falcon_180b.en.json +1 -0
- stats/compress_rate/falcon_180b.zh-Hans.json +1 -0
- stats/compress_rate/falcon_7b.en.json +1 -0
- stats/compress_rate/falcon_7b.zh-Hans.json +1 -0
- stats/compress_rate/fastchat_t5_3b.en.json +1 -0
README.md
CHANGED
@@ -14,9 +14,17 @@ pinned: false
|
|
14 |
## 压缩率 Compress Rate
|
15 |
|
16 |
|
17 |
-
在 [cc-100](https://huggingface.co/datasets/cc100) 数据集,每个语言取1万条数据,测试不同tokenizer
|
18 |
|
19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
20 |
```sh
|
21 |
python utils/compress_rate_util.py
|
22 |
```
|
@@ -24,92 +32,177 @@ python utils/compress_rate_util.py
|
|
24 |
|
25 |
|
26 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
27 |
|
28 |
<details> <summary>简体中文压缩率</summary>
|
29 |
在简体中文数据集 cc100-zh-Hans 计算压缩率
|
30 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
31 |
|
32 |
</details>
|
33 |
|
34 |
-
| tokenizer | vocab_size | g_bytes/b_tokens | t_bytes/t_tokens | b_tokens/g_bytes |
|
35 |
-
|:----------------------------|-------------:|-------------------:|-------------------:|-------------------:|
|
36 |
-
| amber | 32000 | 1.84 | 1.8 | 0.54 |
|
37 |
-
| aya_101 | 250100 | 3.89 | 3.79 | 0.26 |
|
38 |
-
| baichuan | 64000 | 3.92 | 3.82 | 0.26 |
|
39 |
-
| baichuan2 | 125696 | 4.53 | 4.42 | 0.22 |
|
40 |
-
| bert_base_cased | 28996 | 2.73 | 2.66 | 0.37 |
|
41 |
-
| bert_base_chinese | 21128 | 2.74 | 2.67 | 0.37 |
|
42 |
-
| bert_base_uncased | 30522 | 2.73 | 2.67 | 0.37 |
|
43 |
-
| bloom | 250680 | 4.28 | 4.18 | 0.23 |
|
44 |
-
| byt5_small | 256 | 0.93 | 0.91 | 1.08 |
|
45 |
-
| character_glm_6b | 64794 | 4.2 | 4.1 | 0.24 |
|
46 |
-
| chatglm2_6b | 64794 | 4.2 | 4.1 | 0.24 |
|
47 |
-
| chatglm3_6b | 64798 | 4.2 | 4.1 | 0.24 |
|
48 |
-
| chatglm_6b | 150344 | 4.65 | 4.54 | 0.22 |
|
49 |
-
| chatyuan_large_v2 | 32128 | 4.34 | 4.24 | 0.23 |
|
50 |
-
| chinese_llama | 49953 | 3.93 | 3.84 | 0.25 |
|
51 |
-
| chinese_llama2 | 55296 | 3.92 | 3.83 | 0.26 |
|
52 |
-
| code_davinci_002 | 50281 | 1.31 | 1.28 | 0.77 |
|
53 |
-
| crystal_coder | 32000 | 1.86 | 1.81 | 0.54 |
|
54 |
-
| deepseek_coder_33b_instruct | 32000 | 3.4 | 3.32 | 0.29 |
|
55 |
-
| deepseek_llm_7b_base | 100000 | 4.05 | 3.96 | 0.25 |
|
56 |
-
| falcon_180b | 65024 | 2.18 | 2.13 | 0.46 |
|
57 |
-
| falcon_7b | 65024 | 2.18 | 2.13 | 0.46 |
|
58 |
-
| fastchat_t5_3b | 32000 | 13.7 | 13.38 | 0.07 |
|
59 |
-
| flan_t5_base | 32100 | 14.13 | 13.8 | 0.07 |
|
60 |
-
| gemma_7b | 256000 | 3.82 | 3.73 | 0.26 |
|
61 |
-
| gpt2 | 50257 | 1.31 | 1.28 | 0.77 |
|
62 |
-
| gpt2_chinese | 21128 | 2.73 | 2.66 | 0.37 |
|
63 |
-
| gpt_35_turbo | 100277 | 2.26 | 2.21 | 0.44 |
|
64 |
-
| gpt_4 | 100277 | 2.26 | 2.21 | 0.44 |
|
65 |
-
| gpt_nexo_20b | 50254 | 2.01 | 1.96 | 0.5 |
|
66 |
-
| internlm2_chat_7b | 92544 | 4.23 | 4.13 | 0.24 |
|
67 |
-
| internlm2_math_7b | 92544 | 4.23 | 4.13 | 0.24 |
|
68 |
-
| internlm_chat_7b | 103168 | 4.23 | 4.14 | 0.24 |
|
69 |
-
| internlm_xcomposer_7b | 103168 | 4.23 | 4.14 | 0.24 |
|
70 |
-
| kplug | 10261 | 2.72 | 2.65 | 0.37 |
|
71 |
-
| llama | 32000 | 1.84 | 1.8 | 0.54 |
|
72 |
-
| llama2 | 32000 | 1.84 | 1.8 | 0.54 |
|
73 |
-
| mistral_7b | 32000 | 2.36 | 2.3 | 0.42 |
|
74 |
-
| mixtral_8_7b | 32000 | 2.36 | 2.3 | 0.42 |
|
75 |
-
| mobilebert_uncased | 30522 | 2.73 | 2.67 | 0.37 |
|
76 |
-
| moss | 106029 | 4.4 | 4.3 | 0.23 |
|
77 |
-
| mt5_large | 250100 | 3.89 | 3.79 | 0.26 |
|
78 |
-
| olmo_7b | 50280 | 2.01 | 1.96 | 0.5 |
|
79 |
-
| orion_14b_chat | 84608 | 4.63 | 4.52 | 0.22 |
|
80 |
-
| phi_1 | 50257 | 1.31 | 1.28 | 0.77 |
|
81 |
-
| phi_2 | 50257 | 1.31 | 1.28 | 0.77 |
|
82 |
-
| pko_t5_large | 50258 | 0.97 | 0.95 | 1.03 |
|
83 |
-
| prompt_clue | 32128 | 4.34 | 4.24 | 0.23 |
|
84 |
-
| qwen1_5_14b_chat | 151643 | 4.16 | 4.06 | 0.24 |
|
85 |
-
| qwen_1_8b_chat | 151851 | 4.16 | 4.06 | 0.24 |
|
86 |
-
| qwen_72b_chat | 151851 | 4.16 | 4.06 | 0.24 |
|
87 |
-
| qwen_7b_chat | 151851 | 4.16 | 4.06 | 0.24 |
|
88 |
-
| roberta_chinese_clue | 8021 | 2.7 | 2.64 | 0.37 |
|
89 |
-
| skywork_13b_base | 65519 | 3.69 | 3.61 | 0.27 |
|
90 |
-
| skywork_13b_math | 65519 | 3.69 | 3.61 | 0.27 |
|
91 |
-
| solar_10_7b | 32000 | 2.36 | 2.3 | 0.42 |
|
92 |
-
| starchat_alpha | 49152 | 2.78 | 2.72 | 0.36 |
|
93 |
-
| switch_c_2048 | 32100 | 14.13 | 13.8 | 0.07 |
|
94 |
-
| t5_base | 32100 | 14.13 | 13.8 | 0.07 |
|
95 |
-
| t5_large | 32100 | 14.13 | 13.8 | 0.07 |
|
96 |
-
| t5_small | 32100 | 14.13 | 13.8 | 0.07 |
|
97 |
-
| text_davinci_003 | 50281 | 1.31 | 1.28 | 0.77 |
|
98 |
-
| tigerbot_13b_chat_v2 | 60512 | 4.25 | 4.15 | 0.24 |
|
99 |
-
| tigerbot_70b_chat_v4_4k | 65107 | 4.25 | 4.15 | 0.24 |
|
100 |
-
| wizardcoder_15b_v1 | 49152 | 2.78 | 2.72 | 0.36 |
|
101 |
-
| wizardcoder_python_7b_v1 | 32000 | 1.84 | 1.8 | 0.54 |
|
102 |
-
| wizardlm_7b_v1 | 32000 | 1.84 | 1.8 | 0.54 |
|
103 |
-
| wizardmath_70b_v1 | 32000 | 1.84 | 1.8 | 0.54 |
|
104 |
-
| xlm_roberta | 250002 | 3.96 | 3.86 | 0.25 |
|
105 |
-
| yi_34b | 64000 | 4.17 | 4.07 | 0.24 |
|
106 |
-
| yi_6b | 64000 | 4.17 | 4.07 | 0.24 |
|
107 |
-
| yi_vl34b | 64000 | 4.11 | 4.02 | 0.24 |
|
108 |
-
| zephyr_7b_beta | 32000 | 2.36 | 2.3 | 0.42 |
|
109 |
-
|
110 |
-
|
111 |
-
**结论**
|
112 |
-
larger vocabulary sizes
|
113 |
|
114 |
|
115 |
|
|
|
14 |
## 压缩率 Compress Rate
|
15 |
|
16 |
|
17 |
+
在 [cc-100](https://huggingface.co/datasets/cc100) 数据集,每个语言取1万条数据,测试不同tokenizer的压缩率。
|
18 |
|
19 |
+
> 压缩率示例:
|
20 |
+
llama3扩充了词典,具有更高的压缩比。同样1T字节的简体中文语料,llama分词后是 0.56万亿个token,llama3只需要0.31万亿个token。
|
21 |
+
|
22 |
+
| tokenizer | vocab_size | t_bytes/t_tokens | t_tokens/t_bytes | n_chars/n_tokens |
|
23 |
+
|:-----------------------------|-------------:|-------------------:|-------------------:|-------------------:|
|
24 |
+
| llama | 32000 | 1.8 | 0.56 | 0.7 |
|
25 |
+
| llama3 | 128000 | 3.2 | 0.31 | 1.24 |
|
26 |
+
|
27 |
+
可通过以下脚本进行复现
|
28 |
```sh
|
29 |
python utils/compress_rate_util.py
|
30 |
```
|
|
|
32 |
|
33 |
|
34 |
|
35 |
+
<details> <summary>英文压缩率</summary>
|
36 |
+
在英文数据集 cc100-en 计算压缩率
|
37 |
+
|
38 |
+
| tokenizer | vocab_size | g_bytes/b_tokens | b_tokens/g_bytes | t_bytes/t_tokens | t_tokens/t_bytes | n_chars/n_tokens |
|
39 |
+
|:----------------------------|-------------:|-------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
|
40 |
+
| amber | 32000 | 3.56 | 0.28 | 3.47 | 0.29 | 3.81 |
|
41 |
+
| aya_101 | 250100 | 3.3 | 0.3 | 3.22 | 0.31 | 3.53 |
|
42 |
+
| baichuan | 64000 | 3.74 | 0.27 | 3.65 | 0.27 | 4 |
|
43 |
+
| baichuan2 | 125696 | 3.89 | 0.26 | 3.8 | 0.26 | 4.17 |
|
44 |
+
| bert_base_cased | 28996 | 3.64 | 0.27 | 3.55 | 0.28 | 3.89 |
|
45 |
+
| bert_base_chinese | 21128 | 2.78 | 0.36 | 2.71 | 0.37 | 2.97 |
|
46 |
+
| bert_base_uncased | 30522 | 3.73 | 0.27 | 3.65 | 0.27 | 4 |
|
47 |
+
| bloom | 250680 | 4.07 | 0.25 | 3.97 | 0.25 | 4.36 |
|
48 |
+
| byt5_small | 256 | 0.92 | 1.08 | 0.9 | 1.11 | 0.99 |
|
49 |
+
| character_glm_6b | 64794 | 3.62 | 0.28 | 3.54 | 0.28 | 3.88 |
|
50 |
+
| chatglm2_6b | 64794 | 3.62 | 0.28 | 3.54 | 0.28 | 3.88 |
|
51 |
+
| chatglm3_6b | 64798 | 3.62 | 0.28 | 3.54 | 0.28 | 3.88 |
|
52 |
+
| chatglm_6b | 150344 | 3.68 | 0.27 | 3.59 | 0.28 | 3.94 |
|
53 |
+
| chatyuan_large_v2 | 32128 | 1.95 | 0.51 | 1.91 | 0.52 | 2.09 |
|
54 |
+
| chinese_llama | 49953 | 3.59 | 0.28 | 3.51 | 0.28 | 3.85 |
|
55 |
+
| chinese_llama2 | 55296 | 3.56 | 0.28 | 3.47 | 0.29 | 3.81 |
|
56 |
+
| code_davinci_002 | 50281 | 4.05 | 0.25 | 3.96 | 0.25 | 4.34 |
|
57 |
+
| crystal_coder | 32000 | 3.68 | 0.27 | 3.59 | 0.28 | 3.94 |
|
58 |
+
| dbrx_instruct | 100277 | 4.11 | 0.24 | 4.01 | 0.25 | 4.4 |
|
59 |
+
| deepseek_coder_33b_instruct | 32000 | 3.64 | 0.27 | 3.56 | 0.28 | 3.9 |
|
60 |
+
| deepseek_llm_7b_base | 100000 | 3.85 | 0.26 | 3.76 | 0.27 | 4.12 |
|
61 |
+
| falcon_180b | 65024 | 3.99 | 0.25 | 3.9 | 0.26 | 4.27 |
|
62 |
+
| falcon_7b | 65024 | 3.99 | 0.25 | 3.9 | 0.26 | 4.27 |
|
63 |
+
| fastchat_t5_3b | 32000 | 2.16 | 0.46 | 2.11 | 0.47 | 2.31 |
|
64 |
+
| flan_t5_base | 32100 | 3.61 | 0.28 | 3.53 | 0.28 | 3.87 |
|
65 |
+
| gemma_7b | 256000 | 3.91 | 0.26 | 3.82 | 0.26 | 4.18 |
|
66 |
+
| gpt2 | 50257 | 4.05 | 0.25 | 3.96 | 0.25 | 4.34 |
|
67 |
+
| gpt2_chinese | 21128 | 2.67 | 0.37 | 2.61 | 0.38 | 2.86 |
|
68 |
+
| gpt_35_turbo | 100277 | 4.11 | 0.24 | 4.01 | 0.25 | 4.4 |
|
69 |
+
| gpt_4 | 100277 | 4.11 | 0.24 | 4.01 | 0.25 | 4.4 |
|
70 |
+
| gpt_nexo_20b | 50254 | 4.04 | 0.25 | 3.94 | 0.25 | 4.32 |
|
71 |
+
| grok_1 | 131072 | 4.06 | 0.25 | 3.96 | 0.25 | 4.35 |
|
72 |
+
| internlm2_chat_7b | 92544 | 3.86 | 0.26 | 3.77 | 0.27 | 4.13 |
|
73 |
+
| internlm2_math_7b | 92544 | 3.86 | 0.26 | 3.77 | 0.27 | 4.13 |
|
74 |
+
| internlm_chat_7b | 103168 | 3.86 | 0.26 | 3.77 | 0.27 | 4.13 |
|
75 |
+
| internlm_xcomposer_7b | 103168 | 3.86 | 0.26 | 3.77 | 0.27 | 4.13 |
|
76 |
+
| jamba_v0_1 | 65536 | 3.82 | 0.26 | 3.73 | 0.27 | 4.09 |
|
77 |
+
| kplug | 10261 | 2.66 | 0.38 | 2.6 | 0.38 | 2.85 |
|
78 |
+
| llama | 32000 | 3.56 | 0.28 | 3.47 | 0.29 | 3.81 |
|
79 |
+
| llama2 | 32000 | 3.56 | 0.28 | 3.47 | 0.29 | 3.81 |
|
80 |
+
| llama3 | 128000 | 4.11 | 0.24 | 4.01 | 0.25 | 4.4 |
|
81 |
+
| mistral_7b | 32000 | 3.67 | 0.27 | 3.58 | 0.28 | 3.92 |
|
82 |
+
| mixtral_8_7b | 32000 | 3.67 | 0.27 | 3.58 | 0.28 | 3.92 |
|
83 |
+
| mobilebert_uncased | 30522 | 3.73 | 0.27 | 3.65 | 0.27 | 4 |
|
84 |
+
| moss | 106029 | 4.08 | 0.25 | 3.98 | 0.25 | 4.36 |
|
85 |
+
| mt5_large | 250100 | 3.3 | 0.3 | 3.22 | 0.31 | 3.53 |
|
86 |
+
| olmo_7b | 50280 | 4.04 | 0.25 | 3.94 | 0.25 | 4.32 |
|
87 |
+
| orion_14b_chat | 84608 | 3.94 | 0.25 | 3.85 | 0.26 | 4.22 |
|
88 |
+
| phi_1 | 50257 | 4.05 | 0.25 | 3.96 | 0.25 | 4.34 |
|
89 |
+
| phi_2 | 50257 | 4.05 | 0.25 | 3.96 | 0.25 | 4.34 |
|
90 |
+
| pko_t5_large | 50258 | 1.59 | 0.63 | 1.55 | 0.64 | 1.7 |
|
91 |
+
| prompt_clue | 32128 | 1.95 | 0.51 | 1.91 | 0.52 | 2.09 |
|
92 |
+
| qwen1_5_14b_chat | 151643 | 4.06 | 0.25 | 3.97 | 0.25 | 4.35 |
|
93 |
+
| qwen_1_8b_chat | 151851 | 4.06 | 0.25 | 3.97 | 0.25 | 4.35 |
|
94 |
+
| qwen_72b_chat | 151851 | 4.06 | 0.25 | 3.97 | 0.25 | 4.35 |
|
95 |
+
| qwen_7b_chat | 151851 | 4.06 | 0.25 | 3.97 | 0.25 | 4.35 |
|
96 |
+
| roberta_chinese_clue | 8021 | 1.8 | 0.56 | 1.75 | 0.57 | 1.92 |
|
97 |
+
| skywork_13b_base | 65519 | 3.56 | 0.28 | 3.47 | 0.29 | 3.81 |
|
98 |
+
| skywork_13b_math | 65519 | 3.56 | 0.28 | 3.47 | 0.29 | 3.81 |
|
99 |
+
| solar_10_7b | 32000 | 3.67 | 0.27 | 3.58 | 0.28 | 3.92 |
|
100 |
+
| starchat_alpha | 49152 | 3.63 | 0.28 | 3.54 | 0.28 | 3.88 |
|
101 |
+
| switch_c_2048 | 32100 | 3.61 | 0.28 | 3.53 | 0.28 | 3.87 |
|
102 |
+
| t5_base | 32100 | 3.61 | 0.28 | 3.53 | 0.28 | 3.87 |
|
103 |
+
| t5_large | 32100 | 3.61 | 0.28 | 3.53 | 0.28 | 3.87 |
|
104 |
+
| t5_small | 32100 | 3.61 | 0.28 | 3.53 | 0.28 | 3.87 |
|
105 |
+
| text_davinci_003 | 50281 | 4.05 | 0.25 | 3.96 | 0.25 | 4.34 |
|
106 |
+
| tigerbot_13b_chat_v2 | 60512 | 3.67 | 0.27 | 3.58 | 0.28 | 3.93 |
|
107 |
+
| tigerbot_70b_chat_v4_4k | 65107 | 3.65 | 0.27 | 3.57 | 0.28 | 3.91 |
|
108 |
+
| wizardcoder_15b_v1 | 49152 | 3.63 | 0.28 | 3.54 | 0.28 | 3.88 |
|
109 |
+
| wizardcoder_python_7b_v1 | 32000 | 3.56 | 0.28 | 3.47 | 0.29 | 3.81 |
|
110 |
+
| wizardlm_7b_v1 | 32000 | 3.56 | 0.28 | 3.47 | 0.29 | 3.81 |
|
111 |
+
| wizardmath_70b_v1 | 32000 | 3.56 | 0.28 | 3.47 | 0.29 | 3.81 |
|
112 |
+
| xlm_roberta | 250002 | 3.49 | 0.29 | 3.41 | 0.29 | 3.74 |
|
113 |
+
| yi_34b | 64000 | 3.87 | 0.26 | 3.78 | 0.26 | 4.15 |
|
114 |
+
| yi_6b | 64000 | 3.87 | 0.26 | 3.78 | 0.26 | 4.15 |
|
115 |
+
| yi_vl34b | 64000 | 3.88 | 0.26 | 3.79 | 0.26 | 4.16 |
|
116 |
+
| zephyr_7b_beta | 32000 | 3.67 | 0.27 | 3.58 | 0.28 | 3.92 |
|
117 |
+
|
118 |
+
</details>
|
119 |
+
|
120 |
|
121 |
<details> <summary>简体中文压缩率</summary>
|
122 |
在简体中文数据集 cc100-zh-Hans 计算压缩率
|
123 |
|
124 |
+
| tokenizer | vocab_size | g_bytes/b_tokens | b_tokens/g_bytes | t_bytes/t_tokens | t_tokens/t_bytes | n_chars/n_tokens |
|
125 |
+
|:----------------------------|-------------:|-------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
|
126 |
+
| amber | 32000 | 1.84 | 0.54 | 1.8 | 0.56 | 0.7 |
|
127 |
+
| aya_101 | 250100 | 3.89 | 0.26 | 3.79 | 0.26 | 1.47 |
|
128 |
+
| baichuan | 64000 | 3.92 | 0.26 | 3.82 | 0.26 | 1.48 |
|
129 |
+
| baichuan2 | 125696 | 4.53 | 0.22 | 4.42 | 0.23 | 1.71 |
|
130 |
+
| bert_base_cased | 28996 | 2.73 | 0.37 | 2.66 | 0.38 | 1.03 |
|
131 |
+
| bert_base_chinese | 21128 | 2.74 | 0.37 | 2.67 | 0.37 | 1.03 |
|
132 |
+
| bert_base_uncased | 30522 | 2.73 | 0.37 | 2.67 | 0.38 | 1.03 |
|
133 |
+
| bloom | 250680 | 4.28 | 0.23 | 4.18 | 0.24 | 1.62 |
|
134 |
+
| byt5_small | 256 | 0.93 | 1.08 | 0.91 | 1.1 | 0.35 |
|
135 |
+
| character_glm_6b | 64794 | 4.2 | 0.24 | 4.1 | 0.24 | 1.59 |
|
136 |
+
| chatglm2_6b | 64794 | 4.2 | 0.24 | 4.1 | 0.24 | 1.59 |
|
137 |
+
| chatglm3_6b | 64798 | 4.2 | 0.24 | 4.1 | 0.24 | 1.59 |
|
138 |
+
| chatglm_6b | 150344 | 4.65 | 0.22 | 4.54 | 0.22 | 1.76 |
|
139 |
+
| chatyuan_large_v2 | 32128 | 4.34 | 0.23 | 4.24 | 0.24 | 1.64 |
|
140 |
+
| chinese_llama | 49953 | 3.93 | 0.25 | 3.84 | 0.26 | 1.49 |
|
141 |
+
| chinese_llama2 | 55296 | 3.92 | 0.26 | 3.83 | 0.26 | 1.48 |
|
142 |
+
| code_davinci_002 | 50281 | 1.31 | 0.77 | 1.28 | 0.78 | 0.49 |
|
143 |
+
| crystal_coder | 32000 | 1.86 | 0.54 | 1.81 | 0.55 | 0.7 |
|
144 |
+
| dbrx_instruct | 100277 | 2.26 | 0.44 | 2.21 | 0.45 | 0.85 |
|
145 |
+
| deepseek_coder_33b_instruct | 32000 | 3.4 | 0.29 | 3.32 | 0.3 | 1.29 |
|
146 |
+
| deepseek_llm_7b_base | 100000 | 4.05 | 0.25 | 3.96 | 0.25 | 1.53 |
|
147 |
+
| falcon_180b | 65024 | 2.18 | 0.46 | 2.13 | 0.47 | 0.82 |
|
148 |
+
| falcon_7b | 65024 | 2.18 | 0.46 | 2.13 | 0.47 | 0.82 |
|
149 |
+
| fastchat_t5_3b | 32000 | 13.7 | 0.07 | 13.38 | 0.07 | 5.18 |
|
150 |
+
| flan_t5_base | 32100 | 14.13 | 0.07 | 13.8 | 0.07 | 5.34 |
|
151 |
+
| gemma_7b | 256000 | 3.82 | 0.26 | 3.73 | 0.27 | 1.44 |
|
152 |
+
| gpt2 | 50257 | 1.31 | 0.77 | 1.28 | 0.78 | 0.49 |
|
153 |
+
| gpt2_chinese | 21128 | 2.73 | 0.37 | 2.66 | 0.38 | 1.03 |
|
154 |
+
| gpt_35_turbo | 100277 | 2.26 | 0.44 | 2.21 | 0.45 | 0.85 |
|
155 |
+
| gpt_4 | 100277 | 2.26 | 0.44 | 2.21 | 0.45 | 0.85 |
|
156 |
+
| gpt_nexo_20b | 50254 | 2.01 | 0.5 | 1.96 | 0.51 | 0.76 |
|
157 |
+
| grok_1 | 131072 | 1.73 | 0.58 | 1.69 | 0.59 | 0.66 |
|
158 |
+
| internlm2_chat_7b | 92544 | 4.23 | 0.24 | 4.13 | 0.24 | 1.6 |
|
159 |
+
| internlm2_math_7b | 92544 | 4.23 | 0.24 | 4.13 | 0.24 | 1.6 |
|
160 |
+
| internlm_chat_7b | 103168 | 4.23 | 0.24 | 4.14 | 0.24 | 1.6 |
|
161 |
+
| internlm_xcomposer_7b | 103168 | 4.23 | 0.24 | 4.14 | 0.24 | 1.6 |
|
162 |
+
| jamba_v0_1 | 65536 | 2.3 | 0.44 | 2.24 | 0.45 | 0.87 |
|
163 |
+
| kplug | 10261 | 2.72 | 0.37 | 2.65 | 0.38 | 1.03 |
|
164 |
+
| llama | 32000 | 1.84 | 0.54 | 1.8 | 0.56 | 0.7 |
|
165 |
+
| llama2 | 32000 | 1.84 | 0.54 | 1.8 | 0.56 | 0.7 |
|
166 |
+
| llama3 | 128000 | 3.28 | 0.3 | 3.2 | 0.31 | 1.24 |
|
167 |
+
| mistral_7b | 32000 | 2.36 | 0.42 | 2.3 | 0.43 | 0.89 |
|
168 |
+
| mixtral_8_7b | 32000 | 2.36 | 0.42 | 2.3 | 0.43 | 0.89 |
|
169 |
+
| mobilebert_uncased | 30522 | 2.73 | 0.37 | 2.67 | 0.38 | 1.03 |
|
170 |
+
| moss | 106029 | 4.4 | 0.23 | 4.3 | 0.23 | 1.66 |
|
171 |
+
| mt5_large | 250100 | 3.89 | 0.26 | 3.79 | 0.26 | 1.47 |
|
172 |
+
| olmo_7b | 50280 | 2.01 | 0.5 | 1.96 | 0.51 | 0.76 |
|
173 |
+
| orion_14b_chat | 84608 | 4.63 | 0.22 | 4.52 | 0.22 | 1.75 |
|
174 |
+
| phi_1 | 50257 | 1.31 | 0.77 | 1.28 | 0.78 | 0.49 |
|
175 |
+
| phi_2 | 50257 | 1.31 | 0.77 | 1.28 | 0.78 | 0.49 |
|
176 |
+
| pko_t5_large | 50258 | 0.97 | 1.03 | 0.95 | 1.06 | 0.37 |
|
177 |
+
| prompt_clue | 32128 | 4.34 | 0.23 | 4.24 | 0.24 | 1.64 |
|
178 |
+
| qwen1_5_14b_chat | 151643 | 4.16 | 0.24 | 4.06 | 0.25 | 1.57 |
|
179 |
+
| qwen_1_8b_chat | 151851 | 4.16 | 0.24 | 4.06 | 0.25 | 1.57 |
|
180 |
+
| qwen_72b_chat | 151851 | 4.16 | 0.24 | 4.06 | 0.25 | 1.57 |
|
181 |
+
| qwen_7b_chat | 151851 | 4.16 | 0.24 | 4.06 | 0.25 | 1.57 |
|
182 |
+
| roberta_chinese_clue | 8021 | 2.7 | 0.37 | 2.64 | 0.38 | 1.02 |
|
183 |
+
| skywork_13b_base | 65519 | 3.69 | 0.27 | 3.61 | 0.28 | 1.4 |
|
184 |
+
| skywork_13b_math | 65519 | 3.69 | 0.27 | 3.61 | 0.28 | 1.4 |
|
185 |
+
| solar_10_7b | 32000 | 2.36 | 0.42 | 2.3 | 0.43 | 0.89 |
|
186 |
+
| starchat_alpha | 49152 | 2.78 | 0.36 | 2.72 | 0.37 | 1.05 |
|
187 |
+
| switch_c_2048 | 32100 | 14.13 | 0.07 | 13.8 | 0.07 | 5.34 |
|
188 |
+
| t5_base | 32100 | 14.13 | 0.07 | 13.8 | 0.07 | 5.34 |
|
189 |
+
| t5_large | 32100 | 14.13 | 0.07 | 13.8 | 0.07 | 5.34 |
|
190 |
+
| t5_small | 32100 | 14.13 | 0.07 | 13.8 | 0.07 | 5.34 |
|
191 |
+
| text_davinci_003 | 50281 | 1.31 | 0.77 | 1.28 | 0.78 | 0.49 |
|
192 |
+
| tigerbot_13b_chat_v2 | 60512 | 4.25 | 0.24 | 4.15 | 0.24 | 1.61 |
|
193 |
+
| tigerbot_70b_chat_v4_4k | 65107 | 4.25 | 0.24 | 4.15 | 0.24 | 1.61 |
|
194 |
+
| wizardcoder_15b_v1 | 49152 | 2.78 | 0.36 | 2.72 | 0.37 | 1.05 |
|
195 |
+
| wizardcoder_python_7b_v1 | 32000 | 1.84 | 0.54 | 1.8 | 0.56 | 0.7 |
|
196 |
+
| wizardlm_7b_v1 | 32000 | 1.84 | 0.54 | 1.8 | 0.56 | 0.7 |
|
197 |
+
| wizardmath_70b_v1 | 32000 | 1.84 | 0.54 | 1.8 | 0.56 | 0.7 |
|
198 |
+
| xlm_roberta | 250002 | 3.96 | 0.25 | 3.86 | 0.26 | 1.5 |
|
199 |
+
| yi_34b | 64000 | 4.17 | 0.24 | 4.07 | 0.25 | 1.58 |
|
200 |
+
| yi_6b | 64000 | 4.17 | 0.24 | 4.07 | 0.25 | 1.58 |
|
201 |
+
| yi_vl34b | 64000 | 4.11 | 0.24 | 4.02 | 0.25 | 1.56 |
|
202 |
+
| zephyr_7b_beta | 32000 | 2.36 | 0.42 | 2.3 | 0.43 | 0.89 |
|
203 |
|
204 |
</details>
|
205 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
206 |
|
207 |
|
208 |
|
app.py
CHANGED
@@ -78,13 +78,13 @@ with gr.Blocks(css="css/style.css", title="Tokenizer Arena") as demo:
|
|
78 |
gr.Markdown("Please select corpus and unit of compress rate, get more details at [github](https://github.com/xu-song/tokenizer-arena/). ")
|
79 |
with gr.Row():
|
80 |
compress_rate_corpus = gr.CheckboxGroup(
|
81 |
-
["cc100-en", "cc100-zh-Hans", "cc100-es", "code"
|
82 |
value=["cc100-en", "cc100-zh-Hans"],
|
83 |
label="corpus",
|
84 |
# info=""
|
85 |
)
|
86 |
compress_rate_unit = gr.Radio(
|
87 |
-
["b_tokens/g_bytes", "g_bytes/b_tokens", "t_tokens/t_bytes", "t_bytes/t_tokens"],
|
88 |
value="b_tokens/g_bytes",
|
89 |
label="unit",
|
90 |
)
|
|
|
78 |
gr.Markdown("Please select corpus and unit of compress rate, get more details at [github](https://github.com/xu-song/tokenizer-arena/). ")
|
79 |
with gr.Row():
|
80 |
compress_rate_corpus = gr.CheckboxGroup(
|
81 |
+
["cc100-en", "cc100-zh-Hans", "cc100-es"], # , "code"
|
82 |
value=["cc100-en", "cc100-zh-Hans"],
|
83 |
label="corpus",
|
84 |
# info=""
|
85 |
)
|
86 |
compress_rate_unit = gr.Radio(
|
87 |
+
["b_tokens/g_bytes", "g_bytes/b_tokens", "t_tokens/t_bytes", "t_bytes/t_tokens", "n_chars/n_tokens"],
|
88 |
value="b_tokens/g_bytes",
|
89 |
label="unit",
|
90 |
)
|
stats/README.md
ADDED
File without changes
|
stats/compress_rate/amber.en.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 32000, "n_bytes": 1124813, "n_tokens": 294627, "n_chars": 1121360}
|
stats/compress_rate/amber.zh-Hans.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 32000, "n_bytes": 2633047, "n_tokens": 1330093, "n_chars": 927311}
|
stats/compress_rate/aya_101.en.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 250100, "n_bytes": 1124813, "n_tokens": 317881, "n_chars": 1121360}
|
stats/compress_rate/aya_101.zh-Hans.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 250100, "n_bytes": 2633047, "n_tokens": 631182, "n_chars": 927311}
|
stats/compress_rate/baichuan.en.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 64000, "n_bytes": 1124813, "n_tokens": 280108, "n_chars": 1121360}
|
stats/compress_rate/baichuan.zh-Hans.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 64000, "n_bytes": 2633047, "n_tokens": 626117, "n_chars": 927311}
|
stats/compress_rate/baichuan2.en.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 125696, "n_bytes": 1124813, "n_tokens": 269011, "n_chars": 1121360}
|
stats/compress_rate/baichuan2.zh-Hans.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 125696, "n_bytes": 2633047, "n_tokens": 541464, "n_chars": 927311}
|
stats/compress_rate/bert_base_cased.en.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 28996, "n_bytes": 1124813, "n_tokens": 288022, "n_chars": 1121360}
|
stats/compress_rate/bert_base_cased.zh-Hans.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 28996, "n_bytes": 2633047, "n_tokens": 899709, "n_chars": 927311}
|
stats/compress_rate/bert_base_chinese.en.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 21128, "n_bytes": 1124813, "n_tokens": 377068, "n_chars": 1121360}
|
stats/compress_rate/bert_base_chinese.zh-Hans.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 21128, "n_bytes": 2633047, "n_tokens": 896599, "n_chars": 927311}
|
stats/compress_rate/bert_base_uncased.en.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 30522, "n_bytes": 1124813, "n_tokens": 280575, "n_chars": 1121360}
|
stats/compress_rate/bert_base_uncased.zh-Hans.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 30522, "n_bytes": 2633047, "n_tokens": 898554, "n_chars": 927311}
|
stats/compress_rate/bloom.en.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 250680, "n_bytes": 1124813, "n_tokens": 257405, "n_chars": 1121360}
|
stats/compress_rate/bloom.zh-Hans.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 250680, "n_bytes": 2633047, "n_tokens": 573008, "n_chars": 927311}
|
stats/compress_rate/byt5_small.en.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 256, "n_bytes": 1124813, "n_tokens": 1134813, "n_chars": 1121360}
|
stats/compress_rate/byt5_small.zh-Hans.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 256, "n_bytes": 2633047, "n_tokens": 2643047, "n_chars": 927311}
|
stats/compress_rate/character_glm_6b.en.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 64794, "n_bytes": 1124813, "n_tokens": 289347, "n_chars": 1121360}
|
stats/compress_rate/character_glm_6b.zh-Hans.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 64794, "n_bytes": 2633047, "n_tokens": 583646, "n_chars": 927311}
|
stats/compress_rate/chatglm2_6b.en.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 64794, "n_bytes": 1124813, "n_tokens": 289329, "n_chars": 1121360}
|
stats/compress_rate/chatglm2_6b.zh-Hans.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 64794, "n_bytes": 2633047, "n_tokens": 583646, "n_chars": 927311}
|
stats/compress_rate/chatglm3_6b.en.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 64798, "n_bytes": 1124813, "n_tokens": 289347, "n_chars": 1121360}
|
stats/compress_rate/chatglm3_6b.zh-Hans.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 64798, "n_bytes": 2633047, "n_tokens": 583646, "n_chars": 927311}
|
stats/compress_rate/chatglm_6b.en.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 150344, "n_bytes": 1124813, "n_tokens": 284761, "n_chars": 1121360}
|
stats/compress_rate/chatglm_6b.zh-Hans.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 150344, "n_bytes": 2633047, "n_tokens": 527384, "n_chars": 927311}
|
stats/compress_rate/chatyuan_large_v2.en.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 32128, "n_bytes": 1124813, "n_tokens": 536033, "n_chars": 1121360}
|
stats/compress_rate/chatyuan_large_v2.zh-Hans.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 32128, "n_bytes": 2633047, "n_tokens": 564905, "n_chars": 927311}
|
stats/compress_rate/chinese_llama.en.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 49953, "n_bytes": 1124813, "n_tokens": 291514, "n_chars": 1121360}
|
stats/compress_rate/chinese_llama.zh-Hans.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 49953, "n_bytes": 2633047, "n_tokens": 623219, "n_chars": 927311}
|
stats/compress_rate/chinese_llama2.en.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 55296, "n_bytes": 1124813, "n_tokens": 294627, "n_chars": 1121360}
|
stats/compress_rate/chinese_llama2.zh-Hans.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 55296, "n_bytes": 2633047, "n_tokens": 625766, "n_chars": 927311}
|
stats/compress_rate/code_davinci_002.en.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 50281, "n_bytes": 1124813, "n_tokens": 258403, "n_chars": 1121360}
|
stats/compress_rate/code_davinci_002.zh-Hans.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 50281, "n_bytes": 2633047, "n_tokens": 1876809, "n_chars": 927311}
|
stats/compress_rate/crystal_coder.en.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 32000, "n_bytes": 1124813, "n_tokens": 284627, "n_chars": 1121360}
|
stats/compress_rate/crystal_coder.zh-Hans.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 32000, "n_bytes": 2633047, "n_tokens": 1320093, "n_chars": 927311}
|
stats/compress_rate/dbrx_instruct.en.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 100277, "n_bytes": 1124813, "n_tokens": 254985, "n_chars": 1121360}
|
stats/compress_rate/dbrx_instruct.zh-Hans.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 100277, "n_bytes": 2633047, "n_tokens": 1084939, "n_chars": 927311}
|
stats/compress_rate/deepseek_coder_33b_instruct.en.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 32000, "n_bytes": 1124813, "n_tokens": 287408, "n_chars": 1121360}
|
stats/compress_rate/deepseek_coder_33b_instruct.zh-Hans.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 32000, "n_bytes": 2633047, "n_tokens": 720577, "n_chars": 927311}
|
stats/compress_rate/deepseek_llm_7b_base.en.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 100000, "n_bytes": 1124813, "n_tokens": 272324, "n_chars": 1121360}
|
stats/compress_rate/deepseek_llm_7b_base.zh-Hans.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 100000, "n_bytes": 2633047, "n_tokens": 605081, "n_chars": 927311}
|
stats/compress_rate/falcon_180b.en.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 65024, "n_bytes": 1124813, "n_tokens": 262509, "n_chars": 1121360}
|
stats/compress_rate/falcon_180b.zh-Hans.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 65024, "n_bytes": 2633047, "n_tokens": 1124681, "n_chars": 927311}
|
stats/compress_rate/falcon_7b.en.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 65024, "n_bytes": 1124813, "n_tokens": 262509, "n_chars": 1121360}
|
stats/compress_rate/falcon_7b.zh-Hans.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 65024, "n_bytes": 2633047, "n_tokens": 1124681, "n_chars": 927311}
|
stats/compress_rate/fastchat_t5_3b.en.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"vocab_size": 32000, "n_bytes": 1124813, "n_tokens": 484941, "n_chars": 1121360}
|