File size: 1,169 Bytes
e9a1d7b
1642b86
e9a1d7b
1642b86
e9a1d7b
 
a7e8b4b
 
e9a1d7b
 
 
 
 
 
141147e
e9a1d7b
2ccce93
 
 
 
 
 
 
 
 
 
 
 
 
 
 
141147e
e9a1d7b
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#Slovak RoBERTA Masked Language Model

###83Mil Parameters in small model

Medium and Large models coming soon!

RoBERTA pretrained tokenizer vocab and merges included.

---

##Training params:
- **Dataset**:
  8GB Slovak Monolingual dataset including ParaCrawl (monolingual), OSCAR, and several gigs of my own findings and cleaning.
- **Preprocessing**:
  Tokenized with a pretrained ByteLevelBPETokenizer trained on the same dataset. Uncased, with s, pad, /s, unk, and mask special tokens.
- **Evaluation results**:
  - Mnoho ľudí tu MASK
    - žije.
    - žijú.
    - je.
    - trpí.
  - Ako sa MASK
    - máte
    - máš
    -    - hovorí
  - Plážová sezóna pod Zoborom patrí medzi MASK obdobia.
    - ročné
    - najkrajšie
    - najobľúbenejšie
    - najnáročnejšie
    
- **Limitations**:
  The current model is fairly small, although it works very well. This model is meant to be finetuned on downstream tasks e.g. Part-of-Speech tagging, Question Answering, anything in GLUE or SUPERGLUE.
  
- **Credit**:
  If you use this or any of my models in research or professional work, please credit me - Christopher Brousseau in said work.