File size: 1,255 Bytes
03728f5
 
4b5bb53
03728f5
 
4b5bb53
03728f5
 
4b5bb53
03728f5
 
598faec
4b5bb53
 
 
 
03728f5
4b5bb53
 
 
 
 
 
 
 
03728f5
 
4b5bb53
 
03728f5
4b5bb53
03728f5
4b5bb53
 
03728f5
4b5bb53
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
---
library_name: transformers
license: apache-2.0
---

# claude3 tokenizer


for autoregressive/causal


```python
from transformers import AutoTokenizer
tk = AutoTokenizer.from_pretrained("BEE-spoke-data/claude-tokenizer")
tk
```

```
GPT2TokenizerFast(name_or_path='BEE-spoke-data/claude-tokenizer', vocab_size=65000, model_max_length=200000, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<EOT>', 'eos_token': '<EOT>', 'unk_token': '<EOT>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
        0: AddedToken("<EOT>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        1: AddedToken("<META>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        2: AddedToken("<META_START>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        3: AddedToken("<META_END>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        4: AddedToken("<SOS>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}


In [4]: tk.eos_token_id
Out[4]: 0

In [5]: tk.pad_token_id

In [6]: tk.unk_token_id
Out[6]: 0

In [7]: tk.bos_token_id
Out[7]: 0
```