File size: 782 Bytes
68fa243
 
 
a2fccc6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
---
license: mit
---

A GPT2-tokenizer for English and German with a vocabulary size of 88,301.

This tokenizer is created by merging the [original GPT2](https://huggingface.co/gpt2) tokenizer (English) with a [German tokenizer](https://huggingface.co/malteos/gpt2-xl-wechsel-german).

## Steps to reproduce

```python
from transformers import AutoTokenizer

a_tokenizer = AutoTokenizer.from_pretrained('gpt2')
b_tokenizer = AutoTokenizer.from_pretrained('malteos/gpt2-xl-wechsel-german')

a_vocab = set(a_tokenizer.vocab.keys())  # len(a_vocab)=50257
b_vocab = set(b_tokenizer.vocab.keys())  # len(b_vocab)=50257

missing_tokens_in_a = b_vocab - a_vocab  # len = 38044 

a_tokenizer.add_tokens(list(missing_tokens_in_a))

a_tokenizer.save('opengptx-en-de') # len = 88301



```