Edit model card
YAML Metadata Error: "datasets" does not match any of the allowed types

plT5 Base

plT5 models are T5-based language models trained on Polish corpora. The models were optimized for the original T5 denoising target.


plT5 was trained on six different corpora available for Polish language:

Corpus Tokens Documents
CCNet Middle 3243M 7.9M
CCNet Head 2641M 7.0M
National Corpus of Polish 1357M 3.9M
Open Subtitles 1056M 1.1M
Wikipedia 260M 1.4M
Wolne Lektury 41M 5.5k


The training dataset was tokenized into subwords using a sentencepiece unigram model with vocabulary size of 50k tokens.


Example code:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("allegro/plt5-base")
model = AutoModel.from_pretrained("allegro/plt5-base")


CC BY 4.0


If you use this model, please cite the following paper:

  title={Evaluation of Transfer Learning for Polish with a Text-to-Text Model},
  author={Chrabrowa, Aleksandra and Dragan, {\L}ukasz and Grzegorczyk, Karol and Kajtoch, Dariusz and Koszowski, Miko{\l}aj and Mroczkowski, Robert and Rybak, Piotr},
  journal={arXiv preprint arXiv:2205.08808},


The model was trained by Machine Learning Research Team at Allegro and Linguistic Engineering Group at Institute of Computer Science, Polish Academy of Sciences.

You can contact us at: klejbenchmark@allegro.pl

Downloads last month
Hosted inference API
This model can be loaded on the Inference API on-demand.

Dataset used to train allegro/plt5-base