metadata
language:
- el
PaloBERT
A greek pre-trained language model based on RoBERTa.
Pre-training data
The model is pre-trained on a corpus of 458,293 documents collected from greek social media (Twitter, Instagram, Facebook and YouTube). A RoBERTa tokenizer trained from scratch on the same corpus is also included.
The corpus has been provided by Palo LTD
Requirements
pip install transformers
pip install torch
Pre-processing details
In order to use 'palobert-base-greek-social-media', the text needs to be pre-processed as follows:
- remove all greek diacritics
- convert to lowercase
- remove all punctuation
Evaluation on MLM and Sentiment Analysis tasks
For detailed results refer to Thesis: 'Ανάλυση συναισθήματος κειμένου στα Ελληνικά με χρήση Δικτύων Μετασχηματιστών' (version - p2)
Author
Pavlina Chatziantoniou, Georgios Alexandridis and Athanasios Voulodimos
Citation info
http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18623