Spaces:

eson
/

tokenizer-arena

Restarting

App Files Files Community

tokenizer-arena / vocab /gpt_nexo_20b /test_zh_coding_len.py

eson's picture

update

751936e 11 months ago

No virus

447 Bytes

	"""
	1. jd_vocab_tokens的中文：
	编码长度统计： Counter({2: 4190, 3: 1295, 1: 285})
	平均编码长度： 2.1750433275563257


	2. 中文标点
	编码长度统计： Counter({2: 55, 1: 23, 3: 3})
	平均编码长度： 1.7530864197530864

	3. 全中文（单字） unicode
	编码长度统计： Counter({2: 13342, 3: 7257, 1: 302})
	平均编码长度： 2.3327591981244917


	4. 全中文（）
	中文汉字数：313, 中文标点数: 86
	"""