--- language: - el widget: - text: "μεσα δικτυωσης" --- # PaloBERT A greek pre-trained language model based on [RoBERTa](https://arxiv.org/abs/1907.11692). ## Pre-training data The model is pre-trained on a corpus of 458,293 documents collected from greek social media (Twitter, Instagram, Facebook and YouTube). A RoBERTa tokenizer trained from scratch on the same corpus is also included. The corpus has been provided by [Palo LTD](http://www.paloservices.com/). ## Requirements ``` pip install transformers pip install torch ``` ## Pre-processing details In order to use 'palobert-base-greek-social-media', the text needs to be pre-processed as follows: * remove all greek diacritics * convert to lowercase * remove all punctuation ```python import re import unicodedata def preprocess(text, default_replace=""): text = text.lower() text = unicodedata.normalize('NFD',text).translate({ord('\N{COMBINING ACUTE ACCENT}'):None}) text = re.sub(r'[^\w\s]', default_replace, text) return text ``` ## Load Model ```python from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("pchatz/palobert-base-greek-social-media") model = AutoModelForMaskedLM.from_pretrained("pchatz/palobert-base-greek-social-media") ``` You can use this model directly with a pipeline for masked language modeling: ```python from transformers import pipeline fill = pipeline('fill-mask', model=model, tokenizer=tokenizer) fill(f'μεσα {fill.tokenizer.mask_token} δικτυωσης') [{'score': 0.8760559558868408, 'token': 12853, 'token_str': ' κοινωνικης', 'sequence': 'μεσα κοινωνικης δικτυωσης'}, {'score': 0.020922638475894928, 'token': 1104, 'token_str': ' μεσα', 'sequence': 'μεσα μεσα δικτυωσης'}, {'score': 0.017568595707416534, 'token': 337, 'token_str': ' της', 'sequence': 'μεσα της δικτυωσης'}, {'score': 0.006678201723843813, 'token': 1258, 'token_str': 'τικης', 'sequence': 'μεσατικης δικτυωσης'}, {'score': 0.004737381357699633, 'token': 16245, 'token_str': 'τερης', 'sequence': 'μεσατερης δικτυωσης'}] ``` ## Evaluation on MLM and Sentiment Analysis tasks For detailed results refer to Thesis: ['Ανάλυση συναισθήματος κειμένου στα Ελληνικά με χρήση Δικτύων Μετασχηματιστών']( http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18623) (version - p2) ## Author Pavlina Chatziantoniou, Georgios Alexandridis and Athanasios Voulodimos ## Citation info http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18623