Edit model card
YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

#Slovak RoBERTA Masked Language Model

###83Mil Parameters in small model

Medium and Large models coming soon!

RoBERTA pretrained tokenizer vocab and merges included.


##Training params:

  • Dataset: 8GB Slovak Monolingual dataset including ParaCrawl (monolingual), OSCAR, and several gigs of my own findings and cleaning.

  • Preprocessing: Tokenized with a pretrained ByteLevelBPETokenizer trained on the same dataset. Uncased, with s, pad, /s, unk, and mask special tokens.

  • Evaluation results:

    • Mnoho ľudí tu MASK
      • žije.
      • žijú.
      • je.
      • trpí.
    • Ako sa MASK
      • máte
      • máš
      • hovorí
    • Plážová sezóna pod Zoborom patrí medzi MASK obdobia.
      • ročné
      • najkrajšie
      • najobľúbenejšie
      • najnáročnejšie
  • Limitations: The current model is fairly small, although it works very well. This model is meant to be finetuned on downstream tasks e.g. Part-of-Speech tagging, Question Answering, anything in GLUE or SUPERGLUE.

  • Credit: If you use this or any of my models in research or professional work, please credit me - Christopher Brousseau in said work.

Downloads last month
7
Safetensors
Model size
85.8M params
Tensor type
I64
·
F32
·