|
--- |
|
license: mit |
|
--- |
|
|
|
## GPT-2 Tokenizer with unmerged digits |
|
|
|
A fork of the GPT-2 tokenizer, which **removes multi-digit tokens**: |
|
|
|
```python |
|
from transformers import AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained('cyrilzhang/gpt2-numfix') |
|
|
|
tokenizer('123.45') # [16, 17, 18, 13, 19, 20] |
|
gpt2_tokenizer('123.45') # [10163, 13, 2231] |
|
``` |
|
|
|
Backward-compatible: |
|
```python |
|
tokenizer.decode([10163, 46387]) # '<unused123> pigeon' |
|
gpt2_tokenizer.decode([10163, 46387]) # '123 pigeon' |
|
``` |
|
|
|
- This is for my investigations into the arithmetic capabilities of large language models. There is no model here, only a tokenizer. |
|
- [PaLM](https://arxiv.org/abs/2204.02311) does this. I think it's very reasonable. |
|
- Many models (illustriously, [GPT-3](https://arxiv.org/abs/2005.14165)) don't do this, because they use the GPT-2 tokenizer. |