Update README.md
Browse files
README.md
CHANGED
@@ -28,23 +28,26 @@ Future release will also include:
|
|
28 |
|
29 |
## Pre-training details
|
30 |
|
31 |
-
* We trained BERT using the official code provided in Google BERT's
|
32 |
* We released a model similar to the English `bert-base-uncased` model (12-layer, 768-hidden, 12-heads, 110M parameters).
|
33 |
* We chose to follow the same training set-up: 1 million training steps with batches of 256 sequences of length 512 with an initial learning rate 1e-4.
|
34 |
* We were able to use a single Google Cloud TPU v3-8 provided for free from [TensorFlow Research Cloud (TFRC)](https://www.tensorflow.org/tfrc), while also utilizing [GCP research credits](https://edu.google.com/programs/credits/research). Huge thanks to both Google programs for supporting us!
|
35 |
|
|
|
36 |
|
37 |
## Requirements
|
38 |
|
39 |
-
We published `bert-base-greek-uncased-v1` as part of [Hugging Face](https://huggingface.co)'s [Transformers](https://github.com/huggingface/transformers) repository. So, you need to install the
|
40 |
|
41 |
```
|
42 |
-
pip install
|
43 |
pip install (torch|tensorflow)
|
44 |
```
|
45 |
|
46 |
## Pre-process text (Deaccent - Lower)
|
47 |
|
|
|
|
|
48 |
In order to use `bert-base-greek-uncased-v1`, you have to pre-process texts to lowercase letters and remove all Greek diacritics.
|
49 |
|
50 |
```python
|
@@ -114,13 +117,61 @@ print(tokenizer_greek.convert_ids_to_tokens(outputs[0, 8].max(0)[1].item()))
|
|
114 |
|
115 |
## Evaluation on downstream tasks
|
116 |
|
117 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
118 |
|
119 |
## Author
|
120 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
121 |
Ilias Chalkidis on behalf of [AUEB's Natural Language Processing Group](http://nlp.cs.aueb.gr)
|
122 |
|
123 |
-
| Github: [@ilias.chalkidis](https://github.com/
|
124 |
|
125 |
## About Us
|
126 |
|
|
|
28 |
|
29 |
## Pre-training details
|
30 |
|
31 |
+
* We trained BERT using the official code provided in Google BERT's GitHub repository (https://github.com/google-research/bert).* We then used [Hugging Face](https://huggingface.co)'s [Transformers](https://github.com/huggingface/transformers) conversion script to convert the TF checkpoint and vocabulary in the desired format in order to be able to load the model in two lines of code for both PyTorch and TF2 users.
|
32 |
* We released a model similar to the English `bert-base-uncased` model (12-layer, 768-hidden, 12-heads, 110M parameters).
|
33 |
* We chose to follow the same training set-up: 1 million training steps with batches of 256 sequences of length 512 with an initial learning rate 1e-4.
|
34 |
* We were able to use a single Google Cloud TPU v3-8 provided for free from [TensorFlow Research Cloud (TFRC)](https://www.tensorflow.org/tfrc), while also utilizing [GCP research credits](https://edu.google.com/programs/credits/research). Huge thanks to both Google programs for supporting us!
|
35 |
|
36 |
+
\* You can still have access to the original TensorFlow checkpoints from this [Google Drive folder](https://drive.google.com/drive/folders/1ZjlaE4nvdtgqXiVBTVHCF5I9Ff8ZmztE?usp=sharing).
|
37 |
|
38 |
## Requirements
|
39 |
|
40 |
+
We published `bert-base-greek-uncased-v1` as part of [Hugging Face](https://huggingface.co)'s [Transformers](https://github.com/huggingface/transformers) repository. So, you need to install the transformers library through pip along with PyTorch or Tensorflow 2.
|
41 |
|
42 |
```
|
43 |
+
pip install transformers
|
44 |
pip install (torch|tensorflow)
|
45 |
```
|
46 |
|
47 |
## Pre-process text (Deaccent - Lower)
|
48 |
|
49 |
+
**NOTICE:** Preprocessing is now natively supported by the default tokenizer. No need to include the following code.
|
50 |
+
|
51 |
In order to use `bert-base-greek-uncased-v1`, you have to pre-process texts to lowercase letters and remove all Greek diacritics.
|
52 |
|
53 |
```python
|
|
|
117 |
|
118 |
## Evaluation on downstream tasks
|
119 |
|
120 |
+
For detailed results read the article:
|
121 |
+
|
122 |
+
GREEK-BERT: The Greeks visiting Sesame Street. John Koutsikakis, Ilias Chalkidis, Prodromos Malakasiotis and Ion Androutsopoulos. In the Proceedings of the 11th Hellenic Conference on Artificial Intelligence (SETN 2020). Held Online. 2020. (https://arxiv.org/abs/2008.12014)
|
123 |
+
|
124 |
+
|
125 |
+
### Named Entity Recognition with Greek NER dataset
|
126 |
+
|
127 |
+
| Model name | Micro F1 |
|
128 |
+
| ------------------- | ------------------------------------ |
|
129 |
+
BILSTM-CNN-CRF (Ma and Hovy, 2016) | 76.4 ± 2.07
|
130 |
+
M-BERT-UNCASED (Devlin et al., 2019) | 81.5 ± 1.77
|
131 |
+
M-BERT-CASED (Devlin et al., 2019)| 82.1 ± 1.35
|
132 |
+
XLM-R (Conneau et al., 2020)| 84.8 ± 1.50
|
133 |
+
GREEK-BERT (ours) | **85.7 ± 1.00**
|
134 |
+
|
135 |
+
|
136 |
+
### Natural Language Inference with XNLI
|
137 |
+
|
138 |
+
| Model name | Accuracy |
|
139 |
+
| ------------------- | ------------------------------------ |
|
140 |
+
DAM (Parikh et al., 2016) | 68.5 ± 1.71
|
141 |
+
M-BERT-UNCASED (Devlin et al., 2019) | 73.9 ± 0.64
|
142 |
+
M-BERT-CASED (Devlin et al., 2019) | 73.5 ± 0.49
|
143 |
+
XLM-R (Conneau et al., 2020) | 77.3 ± 0.41
|
144 |
+
GREEK-BERT (ours) | **78.6 ± 0.62**
|
145 |
+
|
146 |
+
|
147 |
|
148 |
## Author
|
149 |
|
150 |
+
The model has been officially released with the article "GREEK-BERT: The Greeks visiting Sesame Street. John Koutsikakis, Ilias Chalkidis, Prodromos Malakasiotis and Ion Androutsopoulos. In the Proceedings of the 11th Hellenic Conference on Artificial Intelligence (SETN 2020). Held Online. 2020" (https://arxiv.org/abs/2008.12014).
|
151 |
+
|
152 |
+
If you use the model, please cite the following:
|
153 |
+
|
154 |
+
```
|
155 |
+
@inproceedings{greek-bert,
|
156 |
+
author = {Koutsikakis, John and Chalkidis, Ilias and Malakasiotis, Prodromos and Androutsopoulos, Ion},
|
157 |
+
title = {GREEK-BERT: The Greeks Visiting Sesame Street},
|
158 |
+
year = {2020},
|
159 |
+
isbn = {9781450388788},
|
160 |
+
publisher = {Association for Computing Machinery},
|
161 |
+
address = {New York, NY, USA},
|
162 |
+
url = {https://doi.org/10.1145/3411408.3411440},
|
163 |
+
booktitle = {11th Hellenic Conference on Artificial Intelligence},
|
164 |
+
pages = {110–117},
|
165 |
+
numpages = {8},
|
166 |
+
location = {Athens, Greece},
|
167 |
+
series = {SETN 2020}
|
168 |
+
}
|
169 |
+
```
|
170 |
+
|
171 |
+
|
172 |
Ilias Chalkidis on behalf of [AUEB's Natural Language Processing Group](http://nlp.cs.aueb.gr)
|
173 |
|
174 |
+
| Github: [@ilias.chalkidis](https://github.com/iliaschalkidis) | Twitter: [@KiddoThe2B](https://twitter.com/KiddoThe2B) |
|
175 |
|
176 |
## About Us
|
177 |
|