## ss ![](/images/gptNeoX20B-VS-gpt2.jpg) GPT-2分词与GPT-NeoX-20B分词。GPT-NeoX-20B分词处理空格更好，这对源代码等文本特别有用。

## 20B [configs/20B.yml](https://github.com/EleutherAI/gpt-neox/blob/main/configs/20B.yml#L7) ``` "vocab-file": "./20B_checkpoints/20B_tokenizer.json", ``` Vocab size: 50277 self.padded_vocab_size = 50304 padded vocab (size: 50277) with 27 dummy tokens (new size: 50304) ## 词典见 convert_vocab_to_txt.py ```sh {"id": 13609, "token": "\u00e4\u00b8\u0143", "token_decode": "\u4e2d"} 中 # 多个符号拼接在一起的 {"id": 13663, "token": ".*]{}", "token_decode": ".*]{}"} .*]{} # ss # 基本字节 (\u0021-\u007E) + (\u00A1-\u0143) ``` ## special_tokens https://huggingface.co/EleutherAI/gpt-neox-20b/blob/main/special_tokens_map.json ``` {"bos_token": "<|endoftext|>", "eos_token": "<|endoftext|>", "unk_token": "<|endoftext|>"} ``` https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_neox/tokenization_gpt_neox_fast.py ``` unk_token="<|endoftext|>", bos_token="<|endoftext|>", eos_token="<|endoftext|>", ``` ## 中文支持基本没有OOV。 gpt-neox是在800G英文数据集上训练的，为啥词典支持中文？因为是byte-level BPE ``` 丁 [3218, 212] 七 [3218, 214] 万 [3218, 218] 诀 [11894, 211] 证 [11894, 212] ``` 编码长度统计： Counter({2: 4190, 3: 1295, 1: 285}) 平均编码长度： 2.1750433275563257 ## 完整性 ``` ``` ## build tokenizer ## merge "ard less", ## hf格式 https://huggingface.co/EleutherAI/gpt-neox-20b/tree/main