CypriotBERT
A Cypriot version of BERT pre-trained language model.
Pre-training corpora
The bert-base-cypriot-uncased-v1 pre-training corpora consists of 133 documents sourced from Cypriot TV scripts and writings by Cypriot authors (7MB or 0.07GB of data in total).
Pre-training details
- We trained BERT using our own established framework
- We released a model similar to the English bert-base-uncased model but with 6 layers intead of 12 (6-layer, 768-hidden, 12-heads)
- Total trainable params: 67M
- We chose to follow the default parameter values (rather than following the same training set-up of 1 million training steps with batches of 256 sequences of length 512 with an initial learning rate 1e-4) as we gave more ephasis on establishing a framework that will help us train our models even better in the future.
- We were able to use a Tesla V100-SXM2-32GB and train our model for a duration of 4 hours. Huge thanks to both MantisNLP and The Cyprus Insitute for supporting me!
Requirements
We published bert-base-cypriot-uncased-v1 as part of Hugging Face's Transformers repository. So, you need to install the transformers library through pip along with PyTorch.
pip install transformers[torch]
Pre-process text (Deaccent - Lower)
NOTICE: Preprocessing is now natively supported by the default tokenizer. No need to include the following code.
In order to use bert-base-cypriot-uncased-v1, you have to pre-process texts to lowercase letters and remove all Cypriot diacritics.
import unicodedata
def strip_accents_and_lowercase(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn').lower()
accented_string = "Τούτη εν η Κυπριακή έκδοση του BERT."
unaccented_string = strip_accents_and_lowercase(accented_string)
print(unaccented_string) # τουτη εν η κυπριακη εκδοση του bert.
Load Pretrained Model
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("petros/bert-base-cypriot-uncased-v1")
model = AutoModel.from_pretrained("petros/bert-base-cypriot-uncased-v1")
Use Pretrained Model as a Language Model
import torch
from transformers import *
# Load model and tokenizer
tokenizer_cypriot = AutoTokenizer.from_pretrained('petros/bert-base-cypriot-uncased-v1')
lm_model_cypriot = AutoModelWithLMHead.from_pretrained('petros/bert-base-cypriot-uncased-v1')
# ================ EXAMPLE 1 ================
text_1 = 'Τι [MASK] ρε'
input_ids = tokenizer_cypriot.encode(text_1)
print(tokenizer_cypriot.convert_ids_to_tokens(input_ids)) # ['[CLS]', 'τι', '[MASK]', 'ρε', '[SEP]']
outputs = lm_model_cypriot(torch.tensor([input_ids]))[0]
print(tokenizer_cypriot.convert_ids_to_tokens(outputs[0, 2].max(0)[1].item())) # ειδους
# ================ EXAMPLE 2 ================
text_2 = 'Eίσαι μια [MASK].'
input_ids = tokenizer_cypriot.encode(text_2)
print(tokenizer_cypriot.convert_ids_to_tokens(input_ids)) #['[CLS]', 'eισ', '##αι', 'μια', '[MASK]', '.', '[SEP]']
outputs = lm_model_cypriot(torch.tensor([input_ids]))[0]
print(tokenizer_cypriot.convert_ids_to_tokens(outputs[0, 4].max(0)[1].item())) # χαρα
About Me
Petros Andreou
| Github: @pedroandreou |
- Downloads last month
- 3