IMJONEZZ commited on
Commit
e9a1d7b
1 Parent(s): 1642b86

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +33 -3
README.md CHANGED
@@ -1,5 +1,35 @@
1
- Slovak RoBERTA Masked Language Model
2
 
3
- 83Mil Parameters in small model
4
 
5
- Medium and Large models coming soon
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #Slovak RoBERTA Masked Language Model
2
 
3
+ ###83Mil Parameters in small model
4
 
5
+ Medium and Large models coming soon!
6
+
7
+ ---
8
+
9
+ ##Training params:
10
+ - **Dataset**:
11
+ 8GB Slovak Monolingual dataset including ParaCrawl (monolingual), OSCAR, and several gigs of my own findings and cleaning.
12
+ - **Preprocessing**:
13
+ Tokenized with a pretrained ByteLevelBPETokenizer trained on the same dataset. Uncased, with <s>, <pad>, </s>, <unk>, and <mask> special tokens.
14
+ - **Evaluation results**:
15
+ Mnoho ľudí tu<mask>
16
+ žije.
17
+ žijú.
18
+ je.
19
+ trpí.
20
+ Ako sa<mask>
21
+ máte
22
+ máš
23
+
24
+ hovorí
25
+ Plážová sezóna pod Zoborom patrí medzi<mask> obdobia.
26
+ ročné
27
+ najkrajšie
28
+ najobľúbenejšie
29
+ najnáročnejšie
30
+
31
+ - **Limitations**:
32
+ The current model is fairly small, although it works very well. This model is meant to be finetuned on downstream tasks e.g. Part-of-Speech tagging, Question Answering, anything in GLUE or SUPERGLUE.
33
+
34
+ - **Credit**:
35
+ If you use this or any of my models in research or professional work, please credit me - Christopher Brousseau in said work.