Spaces:

xu-song
/

tokenizer-arena

Running

App Files Files Community

tokenizer-arena / patcher /README.md

xu-song's picture

remove vocabs; update compression_app; add character_app;

2bd606a 6 months ago

|

453 Bytes



	## vocabsize不一致问题


	- .vcab_size
	- Size of the base vocabulary (without the added tokens)
	- 来自 https://huggingface.co/transformers/v2.11.0/main_classes/tokenizer.html
	- len(tokenizer)
	- Size of the full vocabulary with the added tokens.
	- https://github.com/huggingface/transformers/issues/12632
	- max(tokenizer.get_vocab().values())
	- 包括不连续的 token_id
	- https://github.com/huggingface/transformers/issues/4875