Spaces:

eson
/

tokenizer-arena

Running

tokenizer-arena / vocab /README.md

update

9495a4f 10 months ago

No virus

1.12 kB


	## 词典构建

	bert词典
	gpt词典
	gpt-neox词典

	## encode


	## decode

	bert词典有个特殊字符 #

	gpt-neox词典呢？
	- _开头表示空格或句首


	## 关于分词粒度


	## ss



	bert-chinese vocab_size: 21128
	bert-en
	clue
	glm
	chatglm
	bloom


	## bert

	```
	[PAD]
	...
	[unused99]
	[UNK]
	[CLS]
	[SEP]
	[MASK]
	<S>
	<T>
	!
	...

	big
	##ut
	ftp
	carol
	##vi
	```


	##

	https://github.com/pytorch/fairseq/blob/master/tests/test_noising.py#L37

	```
	"he@@", "llo", "n@@", "ew", "y@@", "or@@", "k"
	```

	跟BERT类似，只不过BERT是词后缀，这里是词前缀。


	## GPT2

	词典见：https://huggingface.co/gpt2/raw/main/vocab.json


	```
	['What', "'s", 'Ġup', 'Ġwith', 'Ġthe', 'Ġtoken', 'izer', '?']
	```
	跟BERT不同，BERT用特殊符号表示 “连接”，GPT2用特殊符号表示 “空格”。

	详见 gpt2/README.md

	- 功能符号： `<\|endoftext\|>` 表示换行。tab？空格？
	- 很多数字独立编码，几乎上千个。

	- 类似的还有：moss

	## 空格、tab、换行



	## reversible and lossless

	It's reversible and lossless, so you can convert tokens back into the original text