metadata
language:
- el
pipeline_tag: fill-mask
widget:
- text: Σήμερα εν μια [MASK] μέρα.
CypriotBERT
A Cypriot version of BERT pre-trained language model.
Pre-training corpora
The bert-base-cypriot-uncased-v1 pre-training corpora consist of 133 documents sourced from Cypriot TV scripts and writings by Cypriot authors.
Pre-training details
- We trained BERT using our own established framework
- We released a model similar to the English bert-base-uncased model but with 6 layers intead of 12 (6-layer, 768-hidden, 12-heads).
- We chose to follow the default parameter values rather than following the same training set-up of 1 million training steps with batches of 256 sequences of length 512 with an initial learning rate 1e-4 as we gave more ephasis on establishing a framework that will help us train our models even better in the future.
- We were able to use a Tesla V100-SXM2-32GB. Huge thanks to both MantisNLP and The Cyprus Insitute for supporting me!
Requirements
We published bert-base-cypriot-uncased-v1 as part of Hugging Face's Transformers repository. So, you need to install the transformers library through pip along with PyTorch.
pip install transformers[torch]
Pre-process text (Deaccent - Lower)
NOTICE: Preprocessing is now natively supported by the default tokenizer. No need to include the following code.
In order to use bert-base-cypriot-uncased-v1, you have to pre-process texts to lowercase letters and remove all Cypriot diacritics.
import unicodedata
def strip_accents_and_lowercase(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn').lower()
accented_string = "Αυτή είναι η Ελληνική έκδοση του BERT."
unaccented_string = strip_accents_and_lowercase(accented_string)
print(unaccented_string) # αυτη ειναι η ελληνικη εκδοση του bert.
Load Pretrained Model
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("petros/bert-base-cypriot-uncased-v1")
model = AutoModel.from_pretrained("petros/bert-base-cypriot-uncased-v1")
Use Pretrained Model as a Language Model
import torch
from transformers import *
# Load model and tokenizer
tokenizer_cypriot = AutoTokenizer.from_pretrained('petros/bert-base-cypriot-uncased-v1')
lm_model_cypriot = AutoModelWithLMHead.from_pretrained('petros/bert-base-cypriot-uncased-v1')
# ================ EXAMPLE 1 ================
text_1 = 'Θώρει τη [MASK].'
# EN: 'Sees the [MASK].'
input_ids = tokenizer_cypriot.encode(text_1)
print(tokenizer_cypriot.convert_ids_to_tokens(input_ids))
# ['[CLS]', 'o', 'ποιητης', 'εγραψε', 'ενα', '[MASK]', '.', '[SEP]']
outputs = lm_model_cypriot(torch.tensor([input_ids]))[0]
print(tokenizer_greek.convert_ids_to_tokens(outputs[0, 5].max(0)[1].item()))
# the most plausible prediction for [MASK] is "song"
# ================ EXAMPLE 2 ================
text_2 = 'Είναι ένας [MASK] άνθρωπος.'
# EN: 'He is a [MASK] person.'
input_ids = tokenizer_greek.encode(text_2)
print(tokenizer_greek.convert_ids_to_tokens(input_ids))
# ['[CLS]', 'ειναι', 'ενας', '[MASK]', 'ανθρωπος', '.', '[SEP]']
outputs = lm_model_greek(torch.tensor([input_ids]))[0]
print(tokenizer_greek.convert_ids_to_tokens(outputs[0, 3].max(0)[1].item()))
# the most plausible prediction for [MASK] is "good"
# ================ EXAMPLE 3 ================
text_3 = 'Είναι ένας [MASK] άνθρωπος και κάνει συχνά [MASK].'
# EN: 'He is a [MASK] person he does frequently [MASK].'
input_ids = tokenizer_greek.encode(text_3)
print(tokenizer_greek.convert_ids_to_tokens(input_ids))
# ['[CLS]', 'ειναι', 'ενας', '[MASK]', 'ανθρωπος', 'και', 'κανει', 'συχνα', '[MASK]', '.', '[SEP]']
outputs = lm_model_greek(torch.tensor([input_ids]))[0]
print(tokenizer_greek.convert_ids_to_tokens(outputs[0, 8].max(0)[1].item()))
# the most plausible prediction for the second [MASK] is "trips"
About Me Petros Andreou
| Github: @pedroandreou |