## tokenizer.json 和 tokenizer.model 是 都需要吗? ## 完整性 以下 256个字符保证了词典的完整性 ``` "vocab": { "<0x00>": 3, "<0x01>": 4, ... "<0xFE>": 257, "<0xFF>": 258, ``` ## ```json "normalizer": { "type": "Sequence", "normalizers": [ { "type": "Prepend", "prepend": "▁" }, { "type": "Replace", "pattern": { "String": " " }, "content": "▁" } ] }, "post_processor": { "type": "TemplateProcessing", "single": [ { "SpecialToken": { "id": "", "type_id": 0 } }, { "Sequence": { "id": "A", "type_id": 0 } } ], "pair": [ { "SpecialToken": { "id": "", "type_id": 0 } }, { "Sequence": { "id": "A", "type_id": 0 } }, { "Sequence": { "id": "B", "type_id": 0 } } ], "special_tokens": { "": { "id": "", "ids": [ 1 ], "tokens": [ "" ] } } }, "decoder": { "type": "Sequence", "decoders": [ { "type": "Replace", "pattern": { "String": "▁" }, "content": " " }, { "type": "ByteFallback" }, { "type": "Fuse" }, { "type": "Strip", "content": " ", "start": 1, "stop": 0 } ] }, ``` ## issues 1. https://github.com/LianjiaTech/BELLE/issues/45 llama 700个中文只是显式支持的数量,隐含支持的unicode中文字远超700, 你可以随便用一个bert的词表做实验。不过恶心的是这样一个中文字就会encode成4,5个unicode toekn,长度一下就上去了,所以还是哈工大做中文词表增强的靠谱。 2. https://github.com/LianjiaTech/BELLE/issues/43 请问各位llama在中文上使用需要对词表做额外操作吗? 应该是要的,我测了一下llama词表和常用汉字3500个的交集,只有600多个。增加词表可参考https://github.com/ymcui/Chinese-LLaMA-Alpaca