A GPT2-tokenizer for English and German with a vocabulary size of 88,301.
This tokenizer is created by merging the original GPT2 tokenizer (English) with a German tokenizer.
Steps to reproduce
from transformers import AutoTokenizer
a_tokenizer = AutoTokenizer.from_pretrained('gpt2')
b_tokenizer = AutoTokenizer.from_pretrained('malteos/gpt2-xl-wechsel-german')
a_vocab = set(a_tokenizer.vocab.keys()) # len(a_vocab)=50257
b_vocab = set(b_tokenizer.vocab.keys()) # len(b_vocab)=50257
missing_tokens_in_a = b_vocab - a_vocab # len = 38044
a_tokenizer.add_tokens(list(missing_tokens_in_a))
a_tokenizer.save('opengptx-en-de') # len = 88301
- Downloads last month
- 0
Unable to determine this model's library. Check the
docs
.