metadata

language: pl
tags:
  - herbert
license: cc-by-sa-4.0

HerBERT

HerBERT is a BERT-based Language Model trained on Polish Corpora using MLM and SSO objectives with dynamic masking of whole words. Model training and experiments were conducted with transformers in version 2.9.

Tokenizer

The training dataset was tokenized into subwords using CharBPETokenizer a character level byte-pair encoding with a vocabulary size of 50k tokens. The tokenizer itself was trained with a tokenizers library. We kindly encourage you to use the Fast version of tokenizer, namely HerbertTokenizerFast.

HerBERT usage

Example code:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-large-cased")
model = AutoModel.from_pretrained("allegro/herbert-large-cased")

output = model(
    **tokenizer.batch_encode_plus(
        [
            (
                "A potem szedł środkiem drogi w kurzawie, bo zamiatał nogami, ślepy dziad prowadzony przez tłustego kundla na sznurku.",
                "A potem leciał od lasu chłopak z butelką, ale ten ujrzawszy księdza przy drodze okrążył go z dala i biegł na przełaj pól do karczmy."
            )
        ],
    padding='longest',
    add_special_tokens=True,
    return_tensors='pt'
    )
)

License

CC BY-SA 4.0

Authors

Model was trained by Machine Learning Research Team at Allegro and Linguistic Engineering Group at Institute of Computer Science, Polish Academy of Sciences.

You can contact us at: klejbenchmark@allegro.pl