File size: 2,684 Bytes
1388aee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7910365
 
 
 
 
 
 
 
 
 
 
8240f7c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1388aee
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
---
language:
- el
---

# PaloBERT

A greek pre-trained language model based on [RoBERTa](https://arxiv.org/abs/1907.11692).

## Pre-training data

The model is pre-trained on a corpus of 458,293 documents collected from greek social media (Twitter, Instagram, Facebook and YouTube). A RoBERTa tokenizer trained from scratch on the same corpus is also included.

The corpus has been provided by [Palo LTD](http://www.paloservices.com/)


## Requirements

```
pip install transformers
pip install torch

```

## Pre-processing details

In order to use 'palobert-base-greek-social-media', the text needs to be pre-processed as follows:

* remove all greek diacritics
* convert to lowercase
* remove all punctuation

```python
import re
import unicodedata

def preprocess(text, default_replace=""):
  text = text.lower()
  text = unicodedata.normalize('NFD',text).translate({ord('\N{COMBINING ACUTE ACCENT}'):None})
  text = re.sub(r'[^\w\s]', default_replace, text)
  return text
```

## Load Model

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("pchatz/palobert-base-greek-social-media")

model = AutoModelForMaskedLM.from_pretrained("pchatz/palobert-base-greek-social-media")
```
You can use this model directly with a pipeline for masked language modeling

```python
from transformers import pipeline

fill = pipeline('fill-mask', model=model, tokenizer=tokenizer)
fill(f'μεσα {fill.tokenizer.mask_token} δικτυωσης')

[{'score': 0.8760559558868408,
  'token': 12853,
  'token_str': ' κοινωνικης',
  'sequence': 'μεσα κοινωνικης δικτυωσης'},
 {'score': 0.020922638475894928,
  'token': 1104,
  'token_str': ' μεσα',
  'sequence': 'μεσα μεσα δικτυωσης'},
 {'score': 0.017568595707416534,
  'token': 337,
  'token_str': ' της',
  'sequence': 'μεσα της δικτυωσης'},
 {'score': 0.006678201723843813,
  'token': 1258,
  'token_str': 'τικης',
  'sequence': 'μεσατικης δικτυωσης'},
 {'score': 0.004737381357699633,
  'token': 16245,
  'token_str': 'τερης',
  'sequence': 'μεσατερης δικτυωσης'}]
```

## Evaluation on MLM and Sentiment Analysis tasks

For detailed results refer to Thesis: ['Ανάλυση συναισθήματος κειμένου στα Ελληνικά με χρήση Δικτύων Μετασχηματιστών'](	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18623) (version - p2)

## Author

Pavlina Chatziantoniou, Georgios Alexandridis and Athanasios Voulodimos

## Citation info

http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18623