Mihai-Dan MAŞALA (25095) commited on
Commit
3b2670e
1 Parent(s): 23a9a61

Update README

Browse files
Files changed (1) hide show
  1. README.md +40 -40
README.md CHANGED
@@ -13,11 +13,11 @@ language:
13
  Pretrained model on Romanian language using a masked language modeling (MLM) and next sentence prediction (NSP) objective.
14
  It was introduced in this [paper](https://www.blank.org/). Three BERT models were released: **RoBERT-small**, RoBERT-base and RoBERT-large, all versions uncased.
15
 
16
- | Model | Weights | L | H | A | MLM accuracy | NSP accuracy |
17
- |--------------|:---------:|:------:|:------:|:------:|:------------:|:------------:|
18
- | *RoBERT-small* | *19M* | *12* | *256* | *8* | *0.5363* | *0.9687* |
19
- | RoBERT-base | 114M | 12 | 768 | 12 | 0.6511 | 0.9802 |
20
- | RoBERT-large | 341M | 24 | 1024 | 24 | 0.6929 | 0.9843 |
21
 
22
 
23
 
@@ -53,12 +53,12 @@ outputs = model(**inputs)
53
 
54
  The model is trained on the following compilation of corpora. Note that we present the statistics after the cleaning process.
55
 
56
- Corpus | Words | Sentences | Size (GB)
57
- -------|---------|----------|----------|
58
- Oscar | 1.78B | 87M | 10.8
59
- RoTex | 240M | 14M | 1.5
60
- RoWiki | 50M | 2M | 0.3
61
- Total | 2.07B | 103M | 12.6
62
 
63
 
64
  ## Eval results
@@ -67,45 +67,45 @@ Total | 2.07B | 103M | 12.6
67
 
68
  We report Macro-averaged F1 score (in %)
69
 
70
- Model | Dev | Test
71
- -------|---------|----------
72
- multilingual-BERT | 68.96 | 69.57
73
- XLM-R-base | 71.26 | 71.71
74
- BERT-base-ro | 70.49 | 71.02
75
- RoBERT-small | 66.32 | 66.37
76
- RoBERT-base | 70.89 | 71.61
77
- RoBERT-large | 72.48 | 72.11
78
 
79
  ### Moldavian vs. Romanian Dialect and Cross-dialect Topic identification
80
 
81
  We report results on [VarDial 2019](https://sites.google.com/view/vardial2019/campaign) Moldavian vs. Romanian Cross-dialect Topic identification Challenge, as Macro-averaged F1 score (in %).
82
 
83
- Model | Dialect Classification | MD to RO | RO to MD
84
- -------|---------|----------|----------|
85
- 2-CNN + SVM | 93.40 | 65.09 | 75.21
86
- Char+Word SVM | 96.20 | 69.08 | 81.93
87
- BiGRU | 93.30 | 70.10 | 80.30
88
- multilingual-BERT | 95.34 | 68.76 | 78.24
89
- XLM-R-base | 96.28 | 69.93 | 8228
90
- BERT-base-ro | 96.20 | 69.93 | 78.79
91
- RoBERT-small | 95.67 | 69.01 | 80.40
92
- RoBERT-base | 97.39 | 68.30 | 81.09
93
- RoBERT-large | 97.78 | 69.91 | 83.65
94
 
95
  ### Diacritics Restoration
96
 
97
  Challenge can be found [here](https://diacritics-challenge.speed.pub.ro/). We report results on the official test set, as accuracies in %.
98
 
99
- Model | word level | char level
100
- -------|---------|----------
101
- BiLSTM | 99.42 | -
102
- CharCNN | 98.40 | 99.65
103
- CharCNN + multilingual-BERT | 99.72 | 99.94
104
- CharCNN + XLM-R-base | 99.76 | 99.95
105
- CharCNN + BERT-base-ro | 99.79 | 99.95
106
- CharCNN + RoBERT-small | 99.73 | 99.94
107
- CharCNN + RoBERT-base | 99.78 | 99.95
108
- CharCNN + RoBERT-large | 99.76 | 99.95
109
 
110
 
111
  ### BibTeX entry and citation info
13
  Pretrained model on Romanian language using a masked language modeling (MLM) and next sentence prediction (NSP) objective.
14
  It was introduced in this [paper](https://www.blank.org/). Three BERT models were released: **RoBERT-small**, RoBERT-base and RoBERT-large, all versions uncased.
15
 
16
+ | Model | Weights | L | H | A | MLM accuracy | NSP accuracy |
17
+ |----------------|:---------:|:------:|:------:|:------:|:--------------:|:--------------:|
18
+ | *RoBERT-small* | *19M* | *12* | *256* | *8* | *0.5363* | *0.9687* |
19
+ | RoBERT-base | 114M | 12 | 768 | 12 | 0.6511 | 0.9802 |
20
+ | RoBERT-large | 341M | 24 | 1024 | 24 | 0.6929 | 0.9843 |
21
 
22
 
23
 
53
 
54
  The model is trained on the following compilation of corpora. Note that we present the statistics after the cleaning process.
55
 
56
+ | Corpus | Words | Sentences | Size (GB)|
57
+ |-----------|-----------|-----------|----------|
58
+ | Oscar | 1.78B | 87M | 10.8 |
59
+ | RoTex | 240M | 14M | 1.5 |
60
+ | RoWiki | 50M | 2M | 0.3 |
61
+ | **Total** | **2.07B** | **103M** | **12.6** |
62
 
63
 
64
  ## Eval results
67
 
68
  We report Macro-averaged F1 score (in %)
69
 
70
+ | Model | Dev | Test |
71
+ | -----------------|----------|----------|
72
+ | multilingual-BERT| 68.96 | 69.57 |
73
+ | XLM-R-base | 71.26 | 71.71 |
74
+ | BERT-base-ro | 70.49 | 71.02 |
75
+ | *RoBERT-small* | *66.32* | *66.37* |
76
+ | RoBERT-base | 70.89 | 71.61 |
77
+ | RoBERT-large | **72.48**| **72.11**|
78
 
79
  ### Moldavian vs. Romanian Dialect and Cross-dialect Topic identification
80
 
81
  We report results on [VarDial 2019](https://sites.google.com/view/vardial2019/campaign) Moldavian vs. Romanian Cross-dialect Topic identification Challenge, as Macro-averaged F1 score (in %).
82
 
83
+ | Model | Dialect Classification | MD to RO | RO to MD|
84
+ |-------------------|------------------------|----------|----------|
85
+ | 2-CNN + SVM | 93.40 | 65.09 | 75.21 |
86
+ | Char+Word SVM | 96.20 | 69.08 | 81.93 |
87
+ | BiGRU | 93.30 | **70.10**| 80.30 |
88
+ | multilingual-BERT | 95.34 | 68.76 | 78.24 |
89
+ | XLM-R-base | 96.28 | 69.93 | 82.28 |
90
+ | BERT-base-ro | 96.20 | 69.93 | 78.79 |
91
+ | *RoBERT-small* | *95.67* | *69.01* | *80.40* |
92
+ | RoBERT-base | 97.39 | 68.30 | 81.09 |
93
+ | RoBERT-large | **97.78** | 69.91 | **83.65**|
94
 
95
  ### Diacritics Restoration
96
 
97
  Challenge can be found [here](https://diacritics-challenge.speed.pub.ro/). We report results on the official test set, as accuracies in %.
98
 
99
+ | Model | word level | char level |
100
+ |-----------------------------|------------|------------|
101
+ | BiLSTM | 99.42 | - |
102
+ | CharCNN | 98.40 | 99.65 |
103
+ | CharCNN + multilingual-BERT | 99.72 | 99.94 |
104
+ | CharCNN + XLM-R-base | 99.76 | **99.95** |
105
+ | CharCNN + BERT-base-ro | **99.79** | **99.95** |
106
+ | *CharCNN + RoBERT-small* | *99.73* | *99.94* |
107
+ | CharCNN + RoBERT-base | 99.78 | **99.95** |
108
+ | CharCNN + RoBERT-large | 99.76 | **99.95** |
109
 
110
 
111
  ### BibTeX entry and citation info