Edit model card

CypriotBERT

A Cypriot version of BERT pre-trained language model.

Pre-training corpora

The bert-base-cypriot-uncased-v1 pre-training corpora consists of 133 documents sourced from Cypriot TV scripts and writings by Cypriot authors (7MB or 0.07GB of data in total).

Pre-training details

  • We trained BERT using our own established framework
  • We released a model similar to the English bert-base-uncased model but with 6 layers intead of 12 (6-layer, 768-hidden, 12-heads)
  • Total trainable params: 67M
  • We chose to follow the default parameter values (rather than following the same training set-up of 1 million training steps with batches of 256 sequences of length 512 with an initial learning rate 1e-4) as we gave more ephasis on establishing a framework that will help us train our models even better in the future.
  • We were able to use a Tesla V100-SXM2-32GB and train our model for a duration of 4 hours. Huge thanks to both MantisNLP and The Cyprus Insitute for supporting me!

Requirements

We published bert-base-cypriot-uncased-v1 as part of Hugging Face's Transformers repository. So, you need to install the transformers library through pip along with PyTorch.

pip install transformers[torch]

Pre-process text (Deaccent - Lower)

NOTICE: Preprocessing is now natively supported by the default tokenizer. No need to include the following code.

In order to use bert-base-cypriot-uncased-v1, you have to pre-process texts to lowercase letters and remove all Cypriot diacritics.

import unicodedata

def strip_accents_and_lowercase(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn').lower()

accented_string = "Τούτη εν η Κυπριακή έκδοση του BERT."
unaccented_string = strip_accents_and_lowercase(accented_string)

print(unaccented_string) # τουτη εν η κυπριακη εκδοση του bert.

Load Pretrained Model

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("petros/bert-base-cypriot-uncased-v1")
model = AutoModel.from_pretrained("petros/bert-base-cypriot-uncased-v1")

Use Pretrained Model as a Language Model

import torch
from transformers import *

# Load model and tokenizer
tokenizer_cypriot = AutoTokenizer.from_pretrained('petros/bert-base-cypriot-uncased-v1')
lm_model_cypriot = AutoModelWithLMHead.from_pretrained('petros/bert-base-cypriot-uncased-v1')

# ================ EXAMPLE 1 ================
text_1 = 'Τι [MASK] ρε'
input_ids = tokenizer_cypriot.encode(text_1)
print(tokenizer_cypriot.convert_ids_to_tokens(input_ids)) # ['[CLS]', 'τι', '[MASK]', 'ρε', '[SEP]']

outputs = lm_model_cypriot(torch.tensor([input_ids]))[0]
print(tokenizer_cypriot.convert_ids_to_tokens(outputs[0, 2].max(0)[1].item())) # ειδους

# ================ EXAMPLE 2 ================
text_2 = 'Eίσαι μια [MASK].'
input_ids = tokenizer_cypriot.encode(text_2)
print(tokenizer_cypriot.convert_ids_to_tokens(input_ids)) #['[CLS]', 'eισ', '##αι', 'μια', '[MASK]', '.', '[SEP]']

outputs = lm_model_cypriot(torch.tensor([input_ids]))[0]
print(tokenizer_cypriot.convert_ids_to_tokens(outputs[0, 4].max(0)[1].item())) # χαρα

About Me

Petros Andreou

| Github: @pedroandreou |

Downloads last month
4
Inference API
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Space using petros/bert-base-cypriot-uncased-v1 1