IMJONEZZ commited on
Commit
141147e
1 Parent(s): 34ea2f0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -17
README.md CHANGED
@@ -12,24 +12,24 @@ RoBERTA pretrained tokenizer vocab and merges included.
12
  - **Dataset**:
13
  8GB Slovak Monolingual dataset including ParaCrawl (monolingual), OSCAR, and several gigs of my own findings and cleaning.
14
  - **Preprocessing**:
15
- Tokenized with a pretrained ByteLevelBPETokenizer trained on the same dataset. Uncased, with <s>, <pad>, </s>, <unk>, and <mask> special tokens.
16
  - **Evaluation results**:
17
- Mnoho ľudí tu<mask>
18
- žije.
19
- žijú.
20
- je.
21
- trpí.
22
- Ako sa<mask>
23
- máte
24
- máš
25
-
26
- hovorí
27
- Plážová sezóna pod Zoborom patrí medzi<mask> obdobia.
28
- ročné
29
- najkrajšie
30
- najobľúbenejšie
31
- najnáročnejšie
32
-
33
  - **Limitations**:
34
  The current model is fairly small, although it works very well. This model is meant to be finetuned on downstream tasks e.g. Part-of-Speech tagging, Question Answering, anything in GLUE or SUPERGLUE.
35
 
 
12
  - **Dataset**:
13
  8GB Slovak Monolingual dataset including ParaCrawl (monolingual), OSCAR, and several gigs of my own findings and cleaning.
14
  - **Preprocessing**:
15
+ Tokenized with a pretrained ByteLevelBPETokenizer trained on the same dataset. Uncased, with s, pad, /s, unk, and mask special tokens.
16
  - **Evaluation results**:
17
+ - Mnoho ľudí tu<mask>
18
+ * žije.
19
+ * žijú.
20
+ * je.
21
+ * trpí.
22
+ - Ako sa<mask>
23
+ * máte
24
+ * máš
25
+ *
26
+ * hovorí
27
+ - Plážová sezóna pod Zoborom patrí medzi<mask> obdobia.
28
+ * ročné
29
+ * najkrajšie
30
+ * najobľúbenejšie
31
+ * najnáročnejšie
32
+
33
  - **Limitations**:
34
  The current model is fairly small, although it works very well. This model is meant to be finetuned on downstream tasks e.g. Part-of-Speech tagging, Question Answering, anything in GLUE or SUPERGLUE.
35