HerBERT tokenizer is a character level byte-pair encoding with vocabulary size of 50k tokens. The tokenizer was trained on Wolne Lektury and a publicly available subset of National Corpus of Polish with fastBPE library. Tokenizer utilize XLMTokenizer implementation from transformers.

Tokenizer usage

Herbert tokenizer should be used together with HerBERT model:

from transformers import XLMTokenizer, RobertaModel

tokenizer = XLMTokenizer.from_pretrained("allegro/herbert-klej-cased-tokenizer-v1")
model = RobertaModel.from_pretrained("allegro/herbert-klej-cased-v1")

encoded_input = tokenizer.encode("Kto ma lepszą sztukę, ma lepszy rząd – to jasne.", return_tensors='pt')
outputs = model(encoded_input)


CC BY-SA 4.0


If you use this tokenizer, please cite the following paper:

    title={KLEJ: Comprehensive Benchmark for Polish Language Understanding},
    author={Piotr Rybak and Robert Mroczkowski and Janusz Tracz and Ireneusz Gawlik},

Paper is accepted at ACL 2020, as soon as proceedings appear, we will update the BibTeX.


Tokenizer was created by Allegro Machine Learning Research team.

You can contact us at: klejbenchmark@allegro.pl

