#Slovak RoBERTA Masked Language Model ###83Mil Parameters in small model Medium and Large models coming soon! RoBERTA pretrained tokenizer vocab and merges included. --- ##Training params: - **Dataset**: 8GB Slovak Monolingual dataset including ParaCrawl (monolingual), OSCAR, and several gigs of my own findings and cleaning. - **Preprocessing**: Tokenized with a pretrained ByteLevelBPETokenizer trained on the same dataset. Uncased, with s, pad, /s, unk, and mask special tokens. - **Evaluation results**: - Mnoho ľudí tu MASK - žije. - žijú. - je. - trpí. - Ako sa MASK - máte - máš - má - hovorí - Plážová sezóna pod Zoborom patrí medzi MASK obdobia. - ročné - najkrajšie - najobľúbenejšie - najnáročnejšie - **Limitations**: The current model is fairly small, although it works very well. This model is meant to be finetuned on downstream tasks e.g. Part-of-Speech tagging, Question Answering, anything in GLUE or SUPERGLUE. - **Credit**: If you use this or any of my models in research or professional work, please credit me - Christopher Brousseau in said work.