File size: 918 Bytes
bccccd8
 
 
 
 
e119f9b
627d437
bccccd8
 
 
 
966a300
bccccd8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3caa69f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
```python
from transformers import LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained(
        'ocisd4/openllama_tokenizer_ext_zh',
        add_bos_token=True,
        add_eos_token=False,
        use_auth_token='True',
)

print('vocab size:',tokenizer.vocab_size)
#vocab size: 52928

text = '今天天氣真好!'


print(tokenizer.tokenize(text))
#['▁', '今天', '天氣', '真', '好', '<0xEF>', '<0xBC>', '<0x81>']

print(tokenizer.encode(text))
#[1, 31822, 32101, 32927, 45489, 45301, 242, 191, 132]

print(tokenizer.decode(tokenizer.encode(text)))
# 今天天氣真好!</s>
```

** note: **
 - The first token might be a whitespace in LLamaTokenizer.
 - Open LlaMa的tokenizer is incompatible with original LlaMa 
 - This tokenizer will encode  continuous spaces to ONE space  


### updated
#### 2023-06-02
  - add special tokens: <|pad|>, <|output|>, <|input|>, <|sep|>, <|emb|>, <|rwd|>, <|ctx|>