## vocabsize不一致问题 - .vcab_size - Size of the base vocabulary (without the added tokens) - 来自 https://huggingface.co/transformers/v2.11.0/main_classes/tokenizer.html - len(tokenizer) - Size of the full vocabulary with the added tokens. - https://github.com/huggingface/transformers/issues/12632 - max(tokenizer.get_vocab().values()) - 包括不连续的 token_id - https://github.com/huggingface/transformers/issues/4875