File size: 840 Bytes
704e15f 8ae4896 ed23458 8ae4896 ed23458 8ae4896 b0e76f1 f50db44 1e5b7d8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
---
license: mit
---
## GPT-2 Tokenizer with unmerged digits
A fork of the GPT-2 tokenizer, which **removes multi-digit tokens**:
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('cyrilzhang/gpt2-numfix')
tokenizer('123.45') # [16, 17, 18, 13, 19, 20]
gpt2_tokenizer('123.45') # [10163, 13, 2231]
```
Backward-compatible:
```python
tokenizer.decode([10163, 46387]) # '<unused123> pigeon'
gpt2_tokenizer.decode([10163, 46387]) # '123 pigeon'
```
- This is for my investigations into the arithmetic capabilities of large language models. There is no model here, only a tokenizer.
- [PaLM](https://arxiv.org/abs/2204.02311) does this. I think it's very reasonable.
- Many models (illustriously, [GPT-3](https://arxiv.org/abs/2005.14165)) don't do this, because they use the GPT-2 tokenizer. |