File size: 2,684 Bytes
1388aee 7910365 8240f7c 1388aee |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 |
---
language:
- el
---
# PaloBERT
A greek pre-trained language model based on [RoBERTa](https://arxiv.org/abs/1907.11692).
## Pre-training data
The model is pre-trained on a corpus of 458,293 documents collected from greek social media (Twitter, Instagram, Facebook and YouTube). A RoBERTa tokenizer trained from scratch on the same corpus is also included.
The corpus has been provided by [Palo LTD](http://www.paloservices.com/)
## Requirements
```
pip install transformers
pip install torch
```
## Pre-processing details
In order to use 'palobert-base-greek-social-media', the text needs to be pre-processed as follows:
* remove all greek diacritics
* convert to lowercase
* remove all punctuation
```python
import re
import unicodedata
def preprocess(text, default_replace=""):
text = text.lower()
text = unicodedata.normalize('NFD',text).translate({ord('\N{COMBINING ACUTE ACCENT}'):None})
text = re.sub(r'[^\w\s]', default_replace, text)
return text
```
## Load Model
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("pchatz/palobert-base-greek-social-media")
model = AutoModelForMaskedLM.from_pretrained("pchatz/palobert-base-greek-social-media")
```
You can use this model directly with a pipeline for masked language modeling
```python
from transformers import pipeline
fill = pipeline('fill-mask', model=model, tokenizer=tokenizer)
fill(f'μεσα {fill.tokenizer.mask_token} δικτυωσης')
[{'score': 0.8760559558868408,
'token': 12853,
'token_str': ' κοινωνικης',
'sequence': 'μεσα κοινωνικης δικτυωσης'},
{'score': 0.020922638475894928,
'token': 1104,
'token_str': ' μεσα',
'sequence': 'μεσα μεσα δικτυωσης'},
{'score': 0.017568595707416534,
'token': 337,
'token_str': ' της',
'sequence': 'μεσα της δικτυωσης'},
{'score': 0.006678201723843813,
'token': 1258,
'token_str': 'τικης',
'sequence': 'μεσατικης δικτυωσης'},
{'score': 0.004737381357699633,
'token': 16245,
'token_str': 'τερης',
'sequence': 'μεσατερης δικτυωσης'}]
```
## Evaluation on MLM and Sentiment Analysis tasks
For detailed results refer to Thesis: ['Ανάλυση συναισθήματος κειμένου στα Ελληνικά με χρήση Δικτύων Μετασχηματιστών']( http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18623) (version - p2)
## Author
Pavlina Chatziantoniou, Georgios Alexandridis and Athanasios Voulodimos
## Citation info
http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18623 |