File size: 918 Bytes
bccccd8 e119f9b 627d437 bccccd8 966a300 bccccd8 3caa69f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
```python
from transformers import LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained(
'ocisd4/openllama_tokenizer_ext_zh',
add_bos_token=True,
add_eos_token=False,
use_auth_token='True',
)
print('vocab size:',tokenizer.vocab_size)
#vocab size: 52928
text = '今天天氣真好!'
print(tokenizer.tokenize(text))
#['▁', '今天', '天氣', '真', '好', '<0xEF>', '<0xBC>', '<0x81>']
print(tokenizer.encode(text))
#[1, 31822, 32101, 32927, 45489, 45301, 242, 191, 132]
print(tokenizer.decode(tokenizer.encode(text)))
# 今天天氣真好!</s>
```
** note: **
- The first token might be a whitespace in LLamaTokenizer.
- Open LlaMa的tokenizer is incompatible with original LlaMa
- This tokenizer will encode continuous spaces to ONE space
### updated
#### 2023-06-02
- add special tokens: <|pad|>, <|output|>, <|input|>, <|sep|>, <|emb|>, <|rwd|>, <|ctx|> |