petros's picture
Update README.md
65f44dd
|
raw
history blame
4.18 kB
metadata
language:
  - el
pipeline_tag: fill-mask
widget:
  - text: Σήμερα εν μια [MASK] μέρα.

CypriotBERT

A Cypriot version of BERT pre-trained language model.

Pre-training corpora

The bert-base-cypriot-uncased-v1 pre-training corpora consist of 133 documents sourced from Cypriot TV scripts and writings by Cypriot authors.

Pre-training details

  • We trained BERT using our own established framework
  • We released a model similar to the English bert-base-uncased model but with 6 layers intead of 12 (6-layer, 768-hidden, 12-heads).
  • We chose to follow the default parameter values rather than following the same training set-up of 1 million training steps with batches of 256 sequences of length 512 with an initial learning rate 1e-4 as we gave more ephasis on establishing a framework that will help us train our models even better in the future.
  • We were able to use a Tesla V100-SXM2-32GB. Huge thanks to both MantisNLP and The Cyprus Insitute for supporting me!

Requirements

We published bert-base-cypriot-uncased-v1 as part of Hugging Face's Transformers repository. So, you need to install the transformers library through pip along with PyTorch.

pip install transformers[torch]

Pre-process text (Deaccent - Lower)

NOTICE: Preprocessing is now natively supported by the default tokenizer. No need to include the following code.

In order to use bert-base-cypriot-uncased-v1, you have to pre-process texts to lowercase letters and remove all Cypriot diacritics.


import unicodedata

def strip_accents_and_lowercase(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn').lower()

accented_string = "Αυτή είναι η Ελληνική έκδοση του BERT."
unaccented_string = strip_accents_and_lowercase(accented_string)

print(unaccented_string) # αυτη ειναι η ελληνικη εκδοση του bert.

Load Pretrained Model

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("petros/bert-base-cypriot-uncased-v1")
model = AutoModel.from_pretrained("petros/bert-base-cypriot-uncased-v1")

Use Pretrained Model as a Language Model

import torch
from transformers import *

# Load model and tokenizer
tokenizer_cypriot = AutoTokenizer.from_pretrained('petros/bert-base-cypriot-uncased-v1')
lm_model_cypriot = AutoModelWithLMHead.from_pretrained('petros/bert-base-cypriot-uncased-v1')

# ================ EXAMPLE 1 ================
text_1 = 'Θώρει τη [MASK].'
# EN: 'Sees the [MASK].'
input_ids = tokenizer_cypriot.encode(text_1)
print(tokenizer_cypriot.convert_ids_to_tokens(input_ids))
# ['[CLS]', 'o', 'ποιητης', 'εγραψε', 'ενα', '[MASK]', '.', '[SEP]']
outputs = lm_model_cypriot(torch.tensor([input_ids]))[0]
print(tokenizer_greek.convert_ids_to_tokens(outputs[0, 5].max(0)[1].item()))
# the most plausible prediction for [MASK] is "song"

# ================ EXAMPLE 2 ================
text_2 = 'Είναι ένας [MASK] άνθρωπος.'
# EN: 'He is a [MASK] person.'
input_ids = tokenizer_greek.encode(text_2)
print(tokenizer_greek.convert_ids_to_tokens(input_ids))
# ['[CLS]', 'ειναι', 'ενας', '[MASK]', 'ανθρωπος', '.', '[SEP]']
outputs = lm_model_greek(torch.tensor([input_ids]))[0]
print(tokenizer_greek.convert_ids_to_tokens(outputs[0, 3].max(0)[1].item()))
# the most plausible prediction for [MASK] is "good"

# ================ EXAMPLE 3 ================
text_3 = 'Είναι ένας [MASK] άνθρωπος και κάνει συχνά [MASK].'
# EN: 'He is a [MASK] person he does frequently [MASK].'
input_ids = tokenizer_greek.encode(text_3)
print(tokenizer_greek.convert_ids_to_tokens(input_ids))
# ['[CLS]', 'ειναι', 'ενας', '[MASK]', 'ανθρωπος', 'και', 'κανει', 'συχνα', '[MASK]', '.', '[SEP]']
outputs = lm_model_greek(torch.tensor([input_ids]))[0]
print(tokenizer_greek.convert_ids_to_tokens(outputs[0, 8].max(0)[1].item()))
# the most plausible prediction for the second [MASK] is "trips"

About Me Petros Andreou

| Github: @pedroandreou |