Transformers
Inference Endpoints
File size: 840 Bytes
704e15f
 
 
8ae4896
 
 
ed23458
 
8ae4896
 
 
 
 
 
 
 
 
ed23458
8ae4896
 
 
 
 
b0e76f1
f50db44
1e5b7d8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
---
license: mit
---

## GPT-2 Tokenizer with unmerged digits

A fork of the GPT-2 tokenizer, which **removes multi-digit tokens**:

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('cyrilzhang/gpt2-numfix')

tokenizer('123.45')  # [16, 17, 18, 13, 19, 20]
gpt2_tokenizer('123.45')  # [10163, 13, 2231]
```

Backward-compatible:
```python
tokenizer.decode([10163, 46387])  # '<unused123> pigeon'
gpt2_tokenizer.decode([10163, 46387])  # '123 pigeon'
```

- This is for my investigations into the arithmetic capabilities of large language models. There is no model here, only a tokenizer.
- [PaLM](https://arxiv.org/abs/2204.02311) does this. I think it's very reasonable.
- Many models (illustriously, [GPT-3](https://arxiv.org/abs/2005.14165)) don't do this, because they use the GPT-2 tokenizer.