Spaces:

xu-song
/

tokenizer-arena

Running

App Files Files Community

tokenizer-arena / README.2.md

xu-song

remove vocabs; update compression_app; add character_app;

2bd606a 5 months ago

preview code

raw

history blame

No virus

2.11 kB


	https://arxiv.org/abs/2308.16692 SpeechTokenizer

	对于OpenAI的模型而言，英文的Token效率是中文的8-12倍，
	之前三百字中文以上时Turbo 3.5 16k就会出现逻辑颠倒问题，提示词换成英文后该问题没有出现过。

	## 词典构建

	bert词典
	gpt词典
	gpt-neox词典

	## encode


	## decode

	bert词典有个特殊字符 #

	gpt-neox词典呢？
	- _开头表示空格或句首


	## 关于分词粒度


	## ss



	bert-chinese vocab_size: 21128
	bert-en
	clue
	glm
	chatglm
	bloom


	## 最小词典

	mobilenet


	## ss


	## bert

	```
	[PAD]
	...
	[unused99]
	[UNK]
	[CLS]
	[SEP]
	[MASK]
	<S>
	<T>
	!
	...

	big
	##ut
	ftp
	carol
	##vi
	```


	## @@

	https://github.com/pytorch/fairseq/blob/master/tests/test_noising.py#L37

	```
	"he@@", "llo", "n@@", "ew", "y@@", "or@@", "k"
	```

	跟BERT类似，只不过BERT是词后缀，这里是词前缀。

	这种应该是 https://github.com/rsennrich/subword-nmt


	## GPT2

	词典见：https://huggingface.co/gpt2/raw/main/vocab.json


	```
	['What', "'s", 'Ġup', 'Ġwith', 'Ġthe', 'Ġtoken', 'izer', '?']
	```
	跟BERT不同，BERT用特殊符号表示 “连接”，GPT2用特殊符号表示 “空格”。

	详见 gpt2/README.md

	- 功能符号： `<\|endoftext\|>` 表示换行。tab？空格？
	- 很多数字独立编码，几乎上千个。

	- 类似的还有：moss


	### Ġ是什么

	It's a feature of byte-level BPE(an encoded space character).
	Ġ 表示空格，有的版本用Ä代替Ġ。


	```sh
	What's up with the tokenizer?
	# BPE后
	['What', "'s", 'Ġup', 'Ġwith', 'Ġthe', 'Ġtoken', 'izer', '?']
	# 经过vocab.json编码后
	[ 2061, 338, 510, 351, 262, 11241, 7509, 30]
	# 经过dict.txt编码后（fairseq特有）
	[ 其他数字 ]
	```
	<>
	疑问：up会加Ġ，为什么what不加Ġ，因为有个pre

	- https://github.com/pytorch/fairseq/issues/1716
	- https://github.com/huggingface/transformers/issues/1083


	## 空格、tab、换行





	## reversible and lossless

	It's reversible and lossless, so you can convert tokens back into the original text


	## diff