pchatz's picture
Create README.md
1388aee
|
raw
history blame
No virus
1.21 kB
metadata
language:
  - el

PaloBERT

A greek pre-trained language model based on RoBERTa.

Pre-training data

The model is pre-trained on a corpus of 458,293 documents collected from greek social media (Twitter, Instagram, Facebook and YouTube). A RoBERTa tokenizer trained from scratch on the same corpus is also included.

The corpus has been provided by Palo LTD

Requirements

pip install transformers
pip install torch

Pre-processing details

In order to use 'palobert-base-greek-social-media', the text needs to be pre-processed as follows:

  • remove all greek diacritics
  • convert to lowercase
  • remove all punctuation

Evaluation on MLM and Sentiment Analysis tasks

For detailed results refer to Thesis: 'Ανάλυση συναισθήματος κειμένου στα Ελληνικά με χρήση Δικτύων Μετασχηματιστών' (version - p2)

Author

Pavlina Chatziantoniou, Georgios Alexandridis and Athanasios Voulodimos

Citation info

http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18623