来源: - https://github.com/THUDM/GLM/tree/main/chinese_sentencepiece - https://huggingface.co/THUDM/glm-10b-chinese/ ## HF ``` tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-10b", trust_remote_code=True) ``` ## 分词器 tokenizer_config.json ``` "AutoTokenizer": [ "tokenization_glm.GLMChineseTokenizer", null ] ``` 其中 GLMChineseTokenizer ``` https://huggingface.co/THUDM/glm-10b-chinese/blob/main/tokenization_glm.py ``` ## 词典 来自