Update README.md
Browse files
README.md
CHANGED
@@ -12,24 +12,24 @@ RoBERTA pretrained tokenizer vocab and merges included.
|
|
12 |
- **Dataset**:
|
13 |
8GB Slovak Monolingual dataset including ParaCrawl (monolingual), OSCAR, and several gigs of my own findings and cleaning.
|
14 |
- **Preprocessing**:
|
15 |
-
Tokenized with a pretrained ByteLevelBPETokenizer trained on the same dataset. Uncased, with
|
16 |
- **Evaluation results**:
|
17 |
-
Mnoho ľudí tu<mask>
|
18 |
-
žije.
|
19 |
-
žijú.
|
20 |
-
je.
|
21 |
-
trpí.
|
22 |
-
Ako sa<mask>
|
23 |
-
máte
|
24 |
-
máš
|
25 |
-
má
|
26 |
-
hovorí
|
27 |
-
Plážová sezóna pod Zoborom patrí medzi<mask> obdobia.
|
28 |
-
ročné
|
29 |
-
najkrajšie
|
30 |
-
najobľúbenejšie
|
31 |
-
najnáročnejšie
|
32 |
-
|
33 |
- **Limitations**:
|
34 |
The current model is fairly small, although it works very well. This model is meant to be finetuned on downstream tasks e.g. Part-of-Speech tagging, Question Answering, anything in GLUE or SUPERGLUE.
|
35 |
|
|
|
12 |
- **Dataset**:
|
13 |
8GB Slovak Monolingual dataset including ParaCrawl (monolingual), OSCAR, and several gigs of my own findings and cleaning.
|
14 |
- **Preprocessing**:
|
15 |
+
Tokenized with a pretrained ByteLevelBPETokenizer trained on the same dataset. Uncased, with s, pad, /s, unk, and mask special tokens.
|
16 |
- **Evaluation results**:
|
17 |
+
- Mnoho ľudí tu<mask>
|
18 |
+
* žije.
|
19 |
+
* žijú.
|
20 |
+
* je.
|
21 |
+
* trpí.
|
22 |
+
- Ako sa<mask>
|
23 |
+
* máte
|
24 |
+
* máš
|
25 |
+
* má
|
26 |
+
* hovorí
|
27 |
+
- Plážová sezóna pod Zoborom patrí medzi<mask> obdobia.
|
28 |
+
* ročné
|
29 |
+
* najkrajšie
|
30 |
+
* najobľúbenejšie
|
31 |
+
* najnáročnejšie
|
32 |
+
|
33 |
- **Limitations**:
|
34 |
The current model is fairly small, although it works very well. This model is meant to be finetuned on downstream tasks e.g. Part-of-Speech tagging, Question Answering, anything in GLUE or SUPERGLUE.
|
35 |
|