malteos
/

tokenizer-test

Model card Files Files and versions Community

tokenizer-test / README.md

malteos's picture

Update README.md

a2fccc6 over 1 year ago

|

raw history blame contribute delete

No virus

782 Bytes

	---
	license: mit
	---

	A GPT2-tokenizer for English and German with a vocabulary size of 88,301.

	This tokenizer is created by merging the [original GPT2](https://huggingface.co/gpt2) tokenizer (English) with a [German tokenizer](https://huggingface.co/malteos/gpt2-xl-wechsel-german).

	## Steps to reproduce

	```python
	from transformers import AutoTokenizer

	a_tokenizer = AutoTokenizer.from_pretrained('gpt2')
	b_tokenizer = AutoTokenizer.from_pretrained('malteos/gpt2-xl-wechsel-german')

	a_vocab = set(a_tokenizer.vocab.keys()) # len(a_vocab)=50257
	b_vocab = set(b_tokenizer.vocab.keys()) # len(b_vocab)=50257

	missing_tokens_in_a = b_vocab - a_vocab # len = 38044

	a_tokenizer.add_tokens(list(missing_tokens_in_a))

	a_tokenizer.save('opengptx-en-de') # len = 88301



	```