Back to all models
Model: allegro/herbert-klej-cased-tokenizer-v1

Monthly model downloads

allegro/herbert-klej-cased-tokenizer-v1 allegro/herbert-klej-cased-tokenizer-v1
- downloads
last 30 days

pytorch

tf

Contributed by

Allegro ML Research company
2 team members · 2 models

How to use this model directly from the 🤗/transformers library:

			
Copy model
tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-klej-cased-tokenizer-v1") model = AutoModel.from_pretrained("allegro/herbert-klej-cased-tokenizer-v1")

HerBERT tokenizer

HerBERT tokenizer is a character level byte-pair encoding with vocabulary size of 50k tokens. The tokenizer was trained on Wolne Lektury and a publicly available subset of National Corpus of Polish with fastBPE library. Tokenizer utilize XLMTokenizer implementation from transformers.

Tokenizer usage

Herbert tokenizer should be used together with HerBERT model:

from transformers import XLMTokenizer, RobertaModel

tokenizer = XLMTokenizer.from_pretrained("allegro/herbert-klej-cased-tokenizer-v1")
model = RobertaModel.from_pretrained("allegro/herbert-klej-cased-v1")

encoded_input = tokenizer.encode("Kto ma lepszą sztukę, ma lepszy rząd – to jasne.", return_tensors='pt')
outputs = model(encoded_input)

License

CC BY-SA 4.0

Citation

If you use this tokenizer, please cite the following paper:

@misc{rybak2020klej,
    title={KLEJ: Comprehensive Benchmark for Polish Language Understanding},
    author={Piotr Rybak and Robert Mroczkowski and Janusz Tracz and Ireneusz Gawlik},
    year={2020},
    eprint={2005.00630},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Paper is accepted at ACL 2020, as soon as proceedings appear, we will update the BibTeX.

Authors

Tokenizer was created by Allegro Machine Learning Research team.

You can contact us at: klejbenchmark@allegro.pl