File size: 453 Bytes
2bd606a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16


## vocabsize不一致问题


- .vcab_size
  - Size of the base vocabulary (without the added tokens)
  - 来自 https://huggingface.co/transformers/v2.11.0/main_classes/tokenizer.html
- len(tokenizer)
  - Size of the full vocabulary with the added tokens.
  - https://github.com/huggingface/transformers/issues/12632
- max(tokenizer.get_vocab().values())
  - 包括不连续的 token_id 
  - https://github.com/huggingface/transformers/issues/4875