julien-c HF staff commited on
Commit
922c0c7
1 Parent(s): 23e9baf

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/nlpaueb/bert-base-greek-uncased-v1/README.md

Files changed (1) hide show
  1. README.md +134 -0
README.md ADDED
@@ -0,0 +1,134 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: el
3
+ thumbnail: https://github.com/nlpaueb/GreekBERT/raw/master/greek-bert-logo.png
4
+ ---
5
+
6
+ # GreekBERT
7
+
8
+ A Greek version of BERT pre-trained language model.
9
+
10
+ <img src="https://github.com/nlpaueb/GreekBERT/raw/master/greek-bert-logo.png" width="600"/>
11
+
12
+
13
+ ## Pre-training corpora
14
+
15
+ The pre-training corpora of `bert-base-greek-uncased-v1` include:
16
+
17
+ * The Greek part of [Wikipedia](https://el.wikipedia.org/wiki/Βικιπαίδεια:Αντίγραφα_της_βάσης_δεδομένων),
18
+ * The Greek part of [European Parliament Proceedings Parallel Corpus](https://www.statmt.org/europarl/), and
19
+ * The Greek part of [OSCAR](https://traces1.inria.fr/oscar/), a cleansed version of [Common Crawl](https://commoncrawl.org).
20
+
21
+ Future release will also include:
22
+
23
+ * The entire corpus of Greek legislation, as published by the [National Publication Office](http://www.et.gr),
24
+ * The entire corpus of EU legislation (Greek translation), as published in [Eur-Lex](https://eur-lex.europa.eu/homepage.html?locale=en).
25
+
26
+ ## Pre-training details
27
+
28
+ * We trained BERT using the official code provided in Google BERT's github repository (https://github.com/google-research/bert). We then used [Hugging Face](https://huggingface.co)'s [Transformers](https://github.com/huggingface/transformers) conversion script to convert the TF checkpoint and vocabulary in the desirable format in order to be able to load the model in two lines of code for both PyTorch and TF2 users.
29
+ * We released a model similar to the English `bert-base-uncased` model (12-layer, 768-hidden, 12-heads, 110M parameters).
30
+ * We chose to follow the same training set-up: 1 million training steps with batches of 256 sequences of length 512 with an initial learning rate 1e-4.
31
+ * We were able to use a single Google Cloud TPU v3-8 provided for free from [TensorFlow Research Cloud (TFRC)](https://www.tensorflow.org/tfrc), while also utilizing [GCP research credits](https://edu.google.com/programs/credits/research). Huge thanks to both Google programs for supporting us!
32
+
33
+
34
+ ## Requirements
35
+
36
+ We published `bert-base-greek-uncased-v1` as part of [Hugging Face](https://huggingface.co)'s [Transformers](https://github.com/huggingface/transformers) repository. So, you need to install the transfomers library through pip along with PyTorch or Tensorflow 2.
37
+
38
+ ```
39
+ pip install transfomers
40
+ pip install (torch|tensorflow)
41
+ ```
42
+
43
+ ## Pre-process text (Deaccent - Lower)
44
+
45
+ In order to use `bert-base-greek-uncased-v1`, you have to pre-process texts to lowercase letters and remove all Greek diacritics.
46
+
47
+ ```python
48
+
49
+ import unicodedata
50
+
51
+ def strip_accents_and_lowercase(s):
52
+ return ''.join(c for c in unicodedata.normalize('NFD', s)
53
+ if unicodedata.category(c) != 'Mn').lower()
54
+
55
+ accented_string = "Αυτή είναι η Ελληνική έκδοση του BERT."
56
+ unaccented_string = strip_accents_and_lowercase(accented_string)
57
+
58
+ print(unaccented_string) # αυτη ειναι η ελληνικη εκδοση του bert.
59
+
60
+ ```
61
+
62
+ ## Load Pretrained Model
63
+
64
+ ```python
65
+ from transformers import AutoTokenizer, AutoModel
66
+
67
+ tokenizer = AutoTokenizer.from_pretrained("nlpaueb/bert-base-greek-uncased-v1")
68
+ model = AutoModel.from_pretrained("nlpaueb/bert-base-greek-uncased-v1")
69
+ ```
70
+
71
+ ## Use Pretrained Model as a Language Model
72
+
73
+ ```python
74
+ import torch
75
+ from transformers import *
76
+
77
+ # Load model and tokenizer
78
+ tokenizer_greek = AutoTokenizer.from_pretrained('nlpaueb/bert-base-greek-uncased-v1')
79
+ lm_model_greek = AutoModelWithLMHead.from_pretrained('nlpaueb/bert-base-greek-uncased-v1')
80
+
81
+ # ================ EXAMPLE 1 ================
82
+ text_1 = 'O ποιητής έγραψε ένα [MASK] .'
83
+ # EN: 'The poet wrote a [MASK].'
84
+ input_ids = tokenizer_greek.encode(text_1)
85
+ print(tokenizer_greek.convert_ids_to_tokens(input_ids))
86
+ # ['[CLS]', 'o', 'ποιητης', 'εγραψε', 'ενα', '[MASK]', '.', '[SEP]']
87
+ outputs = lm_model_greek(torch.tensor([input_ids]))[0]
88
+ print(tokenizer_greek.convert_ids_to_tokens(outputs[0, 5].max(0)[1].item()))
89
+ # the most plausible prediction for [MASK] is "song"
90
+
91
+ # ================ EXAMPLE 2 ================
92
+ text_2 = 'Είναι ένας [MASK] άνθρωπος.'
93
+ # EN: 'He is a [MASK] person.'
94
+ input_ids = tokenizer_greek.encode(text_2)
95
+ print(tokenizer_greek.convert_ids_to_tokens(input_ids))
96
+ # ['[CLS]', 'ειναι', 'ενας', '[MASK]', 'ανθρωπος', '.', '[SEP]']
97
+ outputs = lm_model_greek(torch.tensor([input_ids]))[0]
98
+ print(tokenizer_greek.convert_ids_to_tokens(outputs[0, 3].max(0)[1].item()))
99
+ # the most plausible prediction for [MASK] is "good"
100
+
101
+ # ================ EXAMPLE 3 ================
102
+ text_3 = 'Είναι ένας [MASK] άνθρωπος και κάνει συχνά [MASK].'
103
+ # EN: 'He is a [MASK] person he does frequently [MASK].'
104
+ input_ids = tokenizer_greek.encode(text_3)
105
+ print(tokenizer_greek.convert_ids_to_tokens(input_ids))
106
+ # ['[CLS]', 'ειναι', 'ενας', '[MASK]', 'ανθρωπος', 'και', 'κανει', 'συχνα', '[MASK]', '.', '[SEP]']
107
+ outputs = lm_model_greek(torch.tensor([input_ids]))[0]
108
+ print(tokenizer_greek.convert_ids_to_tokens(outputs[0, 8].max(0)[1].item()))
109
+ # the most plausible prediction for the second [MASK] is "trips"
110
+ ```
111
+
112
+ ## Evaluation on downstream tasks
113
+
114
+ TBA
115
+
116
+ ## Author
117
+
118
+ Ilias Chalkidis on behalf of [AUEB's Natural Language Processing Group](http://nlp.cs.aueb.gr)
119
+
120
+ | Github: [@ilias.chalkidis](https://github.com/seolhokim) | Twitter: [@KiddoThe2B](https://twitter.com/KiddoThe2B) |
121
+
122
+ ## About Us
123
+
124
+ [AUEB's Natural Language Processing Group](http://nlp.cs.aueb.gr) develops algorithms, models, and systems that allow computers to process and generate natural language texts.
125
+
126
+ The group's current research interests include:
127
+ * question answering systems for databases, ontologies, document collections, and the Web, especially biomedical question answering,
128
+ * natural language generation from databases and ontologies, especially Semantic Web ontologies,
129
+ text classification, including filtering spam and abusive content,
130
+ * information extraction and opinion mining, including legal text analytics and sentiment analysis,
131
+ * natural language processing tools for Greek, for example parsers and named-entity recognizers,
132
+ machine learning in natural language processing, especially deep learning.
133
+
134
+ The group is part of the Information Processing Laboratory of the Department of Informatics of the Athens University of Economics and Business.