dumitrescustefan
commited on
Commit
•
2126782
1
Parent(s):
9eb44d5
Update README.md
Browse files
README.md
CHANGED
@@ -47,12 +47,40 @@ The baseline is the [Multilingual BERT](https://github.com/google-research/bert/
|
|
47 |
The model is trained on the following corpora (stats in the table below are after cleaning):
|
48 |
|
49 |
| Corpus | Lines(M) | Words(M) | Chars(B) | Size(GB) |
|
50 |
-
|
51 |
| OPUS | 55.05 | 635.04 | 4.045 | 3.8 |
|
52 |
| OSCAR | 33.56 | 1725.82 | 11.411 | 11 |
|
53 |
| Wikipedia | 1.54 | 60.47 | 0.411 | 0.4 |
|
54 |
| **Total** | **90.15** | **2421.33** | **15.867** | **15.2** |
|
55 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
56 |
#### Acknowledgements
|
57 |
|
58 |
- We'd like to thank [Sampo Pyysalo](https://github.com/spyysalo) from TurkuNLP for helping us out with the compute needed to pretrain the v1.0 BERT models. He's awesome!
|
|
|
47 |
The model is trained on the following corpora (stats in the table below are after cleaning):
|
48 |
|
49 |
| Corpus | Lines(M) | Words(M) | Chars(B) | Size(GB) |
|
50 |
+
|-----------|:--------:|:--------:|:--------:|:--------:|
|
51 |
| OPUS | 55.05 | 635.04 | 4.045 | 3.8 |
|
52 |
| OSCAR | 33.56 | 1725.82 | 11.411 | 11 |
|
53 |
| Wikipedia | 1.54 | 60.47 | 0.411 | 0.4 |
|
54 |
| **Total** | **90.15** | **2421.33** | **15.867** | **15.2** |
|
55 |
|
56 |
+
|
57 |
+
### Citation
|
58 |
+
|
59 |
+
If you use this model in a research paper, I'd kindly ask you to cite the following paper:
|
60 |
+
|
61 |
+
```
|
62 |
+
Stefan Dumitrescu, Andrei-Marius Avram, and Sampo Pyysalo. 2020. The birth of Romanian BERT. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4324–4328, Online. Association for Computational Linguistics.
|
63 |
+
```
|
64 |
+
|
65 |
+
or, in bibtex:
|
66 |
+
|
67 |
+
```
|
68 |
+
@inproceedings{dumitrescu-etal-2020-birth,
|
69 |
+
title = "The birth of {R}omanian {BERT}",
|
70 |
+
author = "Dumitrescu, Stefan and
|
71 |
+
Avram, Andrei-Marius and
|
72 |
+
Pyysalo, Sampo",
|
73 |
+
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
|
74 |
+
month = nov,
|
75 |
+
year = "2020",
|
76 |
+
address = "Online",
|
77 |
+
publisher = "Association for Computational Linguistics",
|
78 |
+
url = "https://aclanthology.org/2020.findings-emnlp.387",
|
79 |
+
doi = "10.18653/v1/2020.findings-emnlp.387",
|
80 |
+
pages = "4324--4328",
|
81 |
+
}
|
82 |
+
```
|
83 |
+
|
84 |
#### Acknowledgements
|
85 |
|
86 |
- We'd like to thank [Sampo Pyysalo](https://github.com/spyysalo) from TurkuNLP for helping us out with the compute needed to pretrain the v1.0 BERT models. He's awesome!
|