Back to all models
Model: allegro/herbert-klej-cased-tokenizer-v1

Monthly model downloads

allegro/herbert-klej-cased-tokenizer-v1 allegro/herbert-klej-cased-tokenizer-v1
- downloads
last 30 days



Contributed by

Allegro ML Research company
2 team members · 2 models

How to use this model directly from the 🤗/transformers library:

Copy model
tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-klej-cased-tokenizer-v1") model = AutoModel.from_pretrained("allegro/herbert-klej-cased-tokenizer-v1")

HerBERT tokenizer

HerBERT tokenizer is a character level byte-pair encoding with vocabulary size of 50k tokens. The tokenizer was trained on Wolne Lektury and a publicly available subset of National Corpus of Polish with fastBPE library. Tokenizer utilize XLMTokenizer implementation from transformers.

Tokenizer usage

Herbert tokenizer should be used together with HerBERT model:

from transformers import XLMTokenizer, RobertaModel

tokenizer = XLMTokenizer.from_pretrained("allegro/herbert-klej-cased-tokenizer-v1")
model = RobertaModel.from_pretrained("allegro/herbert-klej-cased-v1")

encoded_input = tokenizer.encode("Kto ma lepszą sztukę, ma lepszy rząd – to jasne.", return_tensors='pt')
outputs = model(encoded_input)


CC BY-SA 4.0


If you use this tokenizer, please cite the following paper:

    title={KLEJ: Comprehensive Benchmark for Polish Language Understanding},
    author={Piotr Rybak and Robert Mroczkowski and Janusz Tracz and Ireneusz Gawlik},

Paper is accepted at ACL 2020, as soon as proceedings appear, we will update the BibTeX.


Tokenizer was created by Allegro Machine Learning Research team.

You can contact us at: