plT5 models are T5-based language models trained on Polish corpora. The models were optimized for the original T5 denoising target.
plT5 was trained on six different corpora available for Polish language:
|National Corpus of Polish||1357M||3.9M|
The training dataset was tokenized into subwords using a sentencepiece unigram model with vocabulary size of 50k tokens.
from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("allegro/plt5-base") model = AutoModel.from_pretrained("allegro/plt5-base")
CC BY 4.0
If you use this model, please cite the following paper:
You can contact us at: email@example.com
- Downloads last month