|
--- |
|
license: mit |
|
--- |
|
|
|
A GPT2-tokenizer for English and German with a vocabulary size of 88,301. |
|
|
|
This tokenizer is created by merging the [original GPT2](https://huggingface.co/gpt2) tokenizer (English) with a [German tokenizer](https://huggingface.co/malteos/gpt2-xl-wechsel-german). |
|
|
|
## Steps to reproduce |
|
|
|
```python |
|
from transformers import AutoTokenizer |
|
|
|
a_tokenizer = AutoTokenizer.from_pretrained('gpt2') |
|
b_tokenizer = AutoTokenizer.from_pretrained('malteos/gpt2-xl-wechsel-german') |
|
|
|
a_vocab = set(a_tokenizer.vocab.keys()) # len(a_vocab)=50257 |
|
b_vocab = set(b_tokenizer.vocab.keys()) # len(b_vocab)=50257 |
|
|
|
missing_tokens_in_a = b_vocab - a_vocab # len = 38044 |
|
|
|
a_tokenizer.add_tokens(list(missing_tokens_in_a)) |
|
|
|
a_tokenizer.save('opengptx-en-de') # len = 88301 |
|
|
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|