```python from transformers import LlamaTokenizer tokenizer = LlamaTokenizer.from_pretrained( 'ocisd4/openllama_tokenizer_ext_zh', add_bos_token=True, add_eos_token=False, use_auth_token='True', ) print('vocab size:',tokenizer.vocab_size) #vocab size: 52928 text = '今天天氣真好!' print(tokenizer.tokenize(text)) #['▁', '今天', '天氣', '真', '好', '<0xEF>', '<0xBC>', '<0x81>'] print(tokenizer.encode(text)) #[1, 31822, 32101, 32927, 45489, 45301, 242, 191, 132] print(tokenizer.decode(tokenizer.encode(text))) # 今天天氣真好! ``` ** note: ** - The first token might be a whitespace in LLamaTokenizer. - Open LlaMa的tokenizer is incompatible with original LlaMa - This tokenizer will encode continuous spaces to ONE space ### updated #### 2023-06-02 - add special tokens: <|pad|>, <|output|>, <|input|>, <|sep|>, <|emb|>, <|rwd|>, <|ctx|>