--- language: pl tags: - herbert license: cc-by-sa-4.0 --- # HerBERT **[HerBERT](https://en.wikipedia.org/wiki/Zbigniew_Herbert)** is a BERT-based Language Model trained on Polish Corpora using MLM and SSO objectives with dynamic masking of whole words. Model training and experiments were conducted with [transformers](https://github.com/huggingface/transformers) in version 2.9. ## Tokenizer The training dataset was tokenized into subwords using ``CharBPETokenizer`` a character level byte-pair encoding with a vocabulary size of 50k tokens. The tokenizer itself was trained with a [tokenizers](https://github.com/huggingface/tokenizers) library. We kindly encourage you to use the **Fast** version of tokenizer, namely ``HerbertTokenizerFast``. ## HerBERT usage Example code: ```python from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-large-cased") model = AutoModel.from_pretrained("allegro/herbert-large-cased") output = model( **tokenizer.batch_encode_plus( [ ( "A potem szedł środkiem drogi w kurzawie, bo zamiatał nogami, ślepy dziad prowadzony przez tłustego kundla na sznurku.", "A potem leciał od lasu chłopak z butelką, ale ten ujrzawszy księdza przy drodze okrążył go z dala i biegł na przełaj pól do karczmy." ) ], padding='longest', add_special_tokens=True, return_tensors='pt' ) ) ``` ## License CC BY-SA 4.0 ## Authors Model was trained by **Machine Learning Research Team at Allegro** and **Linguistic Engineering Group at Institute of Computer Science, Polish Academy of Sciences**. You can contact us at: klejbenchmark@allegro.pl