|
--- |
|
language: |
|
- el |
|
--- |
|
|
|
# PaloBERT |
|
|
|
A greek pre-trained language model based on [RoBERTa](https://arxiv.org/abs/1907.11692). |
|
|
|
## Pre-training data |
|
|
|
The model is pre-trained on a corpus of 458,293 documents collected from greek social media (Twitter, Instagram, Facebook and YouTube). A RoBERTa tokenizer trained from scratch on the same corpus is also included. |
|
|
|
The corpus has been provided by [Palo LTD](http://www.paloservices.com/) |
|
|
|
|
|
## Requirements |
|
|
|
``` |
|
pip install transformers |
|
pip install torch |
|
|
|
``` |
|
|
|
## Pre-processing details |
|
|
|
In order to use 'palobert-base-greek-social-media', the text needs to be pre-processed as follows: |
|
|
|
* remove all greek diacritics |
|
* convert to lowercase |
|
* remove all punctuation |
|
|
|
```python |
|
import re |
|
import unicodedata |
|
|
|
def preprocess(text, default_replace=""): |
|
text = text.lower() |
|
text = unicodedata.normalize('NFD',text).translate({ord('\N{COMBINING ACUTE ACCENT}'):None}) |
|
text = re.sub(r'[^\w\s]', default_replace, text) |
|
return text |
|
``` |
|
|
|
## Load Model |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForMaskedLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("pchatz/palobert-base-greek-social-media") |
|
|
|
model = AutoModelForMaskedLM.from_pretrained("pchatz/palobert-base-greek-social-media") |
|
``` |
|
You can use this model directly with a pipeline for masked language modeling |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
fill = pipeline('fill-mask', model=model, tokenizer=tokenizer) |
|
fill(f'μεσα {fill.tokenizer.mask_token} δικτυωσης') |
|
|
|
[{'score': 0.8760559558868408, |
|
'token': 12853, |
|
'token_str': ' κοινωνικης', |
|
'sequence': 'μεσα κοινωνικης δικτυωσης'}, |
|
{'score': 0.020922638475894928, |
|
'token': 1104, |
|
'token_str': ' μεσα', |
|
'sequence': 'μεσα μεσα δικτυωσης'}, |
|
{'score': 0.017568595707416534, |
|
'token': 337, |
|
'token_str': ' της', |
|
'sequence': 'μεσα της δικτυωσης'}, |
|
{'score': 0.006678201723843813, |
|
'token': 1258, |
|
'token_str': 'τικης', |
|
'sequence': 'μεσατικης δικτυωσης'}, |
|
{'score': 0.004737381357699633, |
|
'token': 16245, |
|
'token_str': 'τερης', |
|
'sequence': 'μεσατερης δικτυωσης'}] |
|
``` |
|
|
|
## Evaluation on MLM and Sentiment Analysis tasks |
|
|
|
For detailed results refer to Thesis: ['Ανάλυση συναισθήματος κειμένου στα Ελληνικά με χρήση Δικτύων Μετασχηματιστών']( http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18623) (version - p2) |
|
|
|
## Author |
|
|
|
Pavlina Chatziantoniou, Georgios Alexandridis and Athanasios Voulodimos |
|
|
|
## Citation info |
|
|
|
http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18623 |