Spaces:

xu-song
/

tokenizer-arena

Running

App Files Files Community

tokenizer-arena / README.md

xu-song

update

97354e0 5 months ago

preview code

raw

history blame

No virus

3.83 kB

	---
	title: Tokenizer Arena
	emoji: ⚡
	colorFrom: red
	colorTo: gray
	sdk: gradio
	sdk_version: 4.31.4
	app_file: app.py
	pinned: false
	datasets:
	- cc100
	---



	## 压缩率 Compress Rate


	在 [cc-100](https://huggingface.co/datasets/cc100) 数据集，每个语言取1万条数据，测试不同tokenizer的压缩率。

	> 压缩率示例：
	llama3扩充了词典，具有更高的压缩比。同样1T字节的简体中文语料，llama分词后是 0.56万亿个token，llama3只需要0.31万亿个token。

	\| tokenizer \| vocab_size \| t_bytes/t_tokens \| t_tokens/t_bytes \| n_chars/n_tokens \|
	\|:-----------------------------\|-------------:\|-------------------:\|-------------------:\|-------------------:\|
	\| llama \| 32000 \| 1.8 \| 0.56 \| 0.7 \|
	\| llama3 \| 128000 \| 3.2 \| 0.31 \| 1.24 \|

	可通过以下脚本进行复现
	```sh
	python utils/compress_rate_util.py
	```




	<details> <summary>英文压缩率</summary>
	在英文数据集 cc100-en 计算压缩率

	\| tokenizer \| vocab_size \| g_bytes/b_tokens \| b_tokens/g_bytes \| t_bytes/t_tokens \| t_tokens/t_bytes \| n_chars/n_tokens \|
	\|:----------------------------\|-------------:\|-------------------:\|-------------------:\|-------------------:\|-------------------:\|-------------------:\|
	\| amber \| 32000 \| 3.56 \| 0.28 \| 3.47 \| 0.29 \| 3.81 \|
	\| aya_101 \| 250100 \| 3.3 \| 0.3 \| 3.22 \| 0.31 \| 3.53 \|
	\| baichuan \| 64000 \| 3.74 \| 0.27 \| 3.65 \| 0.27 \| 4 \|
	\| baichuan2 \| 125696 \| 3.89 \| 0.26 \| 3.8 \| 0.26 \| 4.17 \|

	</details>


	<details> <summary>简体中文压缩率</summary>
	在简体中文数据集 cc100-zh-Hans 计算压缩率

	\| tokenizer \| vocab_size \| g_bytes/b_tokens \| b_tokens/g_bytes \| t_bytes/t_tokens \| t_tokens/t_bytes \| n_chars/n_tokens \|
	\|:----------------------------\|-------------:\|-------------------:\|-------------------:\|-------------------:\|-------------------:\|-------------------:\|
	\| amber \| 32000 \| 1.84 \| 0.54 \| 1.8 \| 0.56 \| 0.7 \|
	\| aya_101 \| 250100 \| 3.89 \| 0.26 \| 3.79 \| 0.26 \| 1.47 \|
	\| baichuan \| 64000 \| 3.92 \| 0.26 \| 3.82 \| 0.26 \| 1.48 \|

	</details>




	## Reference

	- Getting the most out of your tokenizer for pre-training and domain adaptation
	- Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca
	- blog
	- https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
	- https://huggingface.co/docs/transformers/tokenizer_summary#sentencepiece
	- https://www.huaxiaozhuan.com/%E5%B7%A5%E5%85%B7/huggingface_transformer/chapters/1_tokenizer.html
	- https://zhuanlan.zhihu.com/p/652520262
	- https://github.com/QwenLM/Qwen/blob/main/tokenization_note_zh.md
	- https://tonybaloney.github.io/posts/cjk-chinese-japanese-korean-llm-ai-best-practices.html
	-
	- demo
	- https://huggingface.co/spaces/Xenova/the-tokenizer-playground
	- https://github.com/dqbd/tiktokenizer
	- https://chat.lmsys.org/?leaderboard
	- https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
	- paper
	- ss
	-