Mihai-Dan MAŞALA (25095) commited on
Commit
364ef40
1 Parent(s): 38b96c4

Updated README

Browse files
Files changed (1) hide show
  1. README.md +87 -6
README.md CHANGED
@@ -8,25 +8,106 @@ language:
8
  # RoBERT-small
9
 
10
 
11
- ## BERT small model for Romanian
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
 
14
  #### How to use
15
 
16
- TBC
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
 
18
 
19
  ## Training data
20
 
21
- TBC
22
 
23
- ## Training procedure
 
 
 
 
 
24
 
25
- TBC
26
 
27
  ## Eval results
28
 
29
- TBC
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
  ### BibTeX entry and citation info
32
 
8
  # RoBERT-small
9
 
10
 
11
+ ## Pretrained BERT model for Romanian
12
+
13
+ Pretrained model on Romanian language using a masked language modeling (MLM) and next sentence prediction (NSP) objective.
14
+ It was introduced in this [paper](https://www.blank.org/). Three BERT models were released: RoBERT-small, RoBERT-base and RoBERT-large, all versions uncased.
15
+
16
+ Model | Weights | L | H | A | MLM accuracy | NSP accuracy
17
+ -------|---------|----------|----------|----------|----------|----------|
18
+ RoBERT-small | 19M | 12 | 256 | 8 | 0.5363 | 0.9687
19
+ RoBERT-base | 114M | 12 | 768 | 12 | 0.6511 | 0.9802
20
+ RoBERT-large | 341M | 24 | 1024 | 24 | 0.6929 | 0.9843
21
+
22
+
23
+ All models are available:
24
+
25
+ * [RoBERT-small](https://huggingface.co/readerbench/RoBERT-small)
26
+ * [RoBERT-base](https://huggingface.co/readerbench/RoBERT-base)
27
+ * [RoBERT-large](https://huggingface.co/readerbench/RoBERT-large)
28
+
29
 
30
 
31
  #### How to use
32
 
33
+ ```python
34
+ # tensorflow
35
+ from transformers import AutoModel, AutoTokenizer, TFAutoModel
36
+ tokenizer = AutoTokenizer.from_pretrained("readerbench/RoBERT-small")
37
+ model = TFAutoModel.from_pretrained("readerbench/RoBERT-small")
38
+ inputs = tokenizer("exemplu de propoziție", return_tensors="tf")
39
+ outputs = model(inputs)
40
+
41
+ # pytorch
42
+ from transformers import AutoModel, AutoTokenizer, AutoModel
43
+ tokenizer = AutoTokenizer.from_pretrained("readerbench/RoBERT-small")
44
+ model = AutoModel.from_pretrained("readerbench/RoBERT-small")
45
+ inputs = tokenizer("exemplu de propoziție", return_tensors="pt")
46
+ outputs = model(**inputs)
47
+ ```
48
 
49
 
50
  ## Training data
51
 
52
+ The model is trained on the following compilation of corpora. Note that we present the statistics after the cleaning process.
53
 
54
+ Corpus | Words | Sentences | Size (GB)
55
+ -------|---------|----------|----------|
56
+ Oscar | 1.78B | 87M | 10.8
57
+ RoTex | 240M | 14M | 1.5
58
+ RoWiki | 50M | 2M | 0.3
59
+ Total | 2.07B | 103M | 12.6
60
 
 
61
 
62
  ## Eval results
63
 
64
+ ### Sentiment analysis
65
+
66
+ We report Macro-averaged F1 score (in %)
67
+
68
+ Model | Dev | Test
69
+ -------|---------|----------
70
+ multilingual-BERT | 68.96 | 69.57
71
+ XLM-R-base | 71.26 | 71.71
72
+ [BERT-base-ro](https://huggingface.co/dumitrescustefan/bert-base-romanian-uncased-v1) | 70.49 | 71.02
73
+ RoBERT-small | 66.32 | 66.37
74
+ RoBERT-base | 70.89 | 71.61
75
+ RoBERT-large | 72.48 | 72.11
76
+
77
+ ### Moldavian vs. Romanian Dialect and Cross-dialect Topic identification
78
+
79
+ We report results on [VarDial 2019](https://sites.google.com/view/vardial2019/campaign) Moldavian vs. Romanian Cross-dialect Topic identification Challenge, as Macro-averaged F1 score (in %)
80
+
81
+ Model | Dialect Classification | MD to RO | RO to MD
82
+ -------|---------|----------
83
+ 2-CNN + SVM | 93.40 | 65.09 | 75.21
84
+ Char+Word SVM | 96.20 | 69.08 | 81.93
85
+ BiGRU | 93.30 | 70.10 | 80.30
86
+
87
+ multilingual-BERT | 95.34 | 68.76 | 78.24
88
+ XLM-R-base | 96.28 | 69.93 | 8228
89
+ [BERT-base-ro](https://huggingface.co/dumitrescustefan/bert-base-romanian-uncased-v1) | 96.20 | 69.93 | 78.79
90
+ RoBERT-small | 95.67 | 69.01 | 80.40
91
+ RoBERT-base | 97.39 | 68.30 | 81.09
92
+ RoBERT-large | 97.78 | 69.91 | 83.65
93
+
94
+ ### Diacritics Restoration
95
+
96
+ Challenge can be found [here](https://diacritics-challenge.speed.pub.ro/).
97
+
98
+ We report results on the official test set, as accuracies in %.
99
+
100
+ Model | word level | char level
101
+ -------|---------|----------
102
+ BiLSTM | 99.42 | -
103
+ CharCNN | 98.40 | 99.65
104
+ CharCNN + multilingual-BERT | 99.72 | 99.94
105
+ CharCNN + XLM-R-base | 99.76 | 99.95
106
+ CharCNN + [BERT-base-ro](https://huggingface.co/dumitrescustefan/bert-base-romanian-uncased-v1) | 99.79 | 99.95
107
+ CharCNN + RoBERT-small | 99.73 | 99.94
108
+ CharCNN + RoBERT-base | 99.78 | 99.95
109
+ CharCNN + RoBERT-large | 99.76 | 99.95
110
+
111
 
112
  ### BibTeX entry and citation info
113