pchatz
/

palobert-base-greek-social-media-v2

Inference Endpoints

Model card Files Files and versions Community

palobert-base-greek-social-media-v2 / README.md

pchatz's picture

Update README.md

7910365 over 1 year ago

|

2.68 kB

	---
	language:
	- el
	---

	# PaloBERT

	A greek pre-trained language model based on [RoBERTa](https://arxiv.org/abs/1907.11692).

	## Pre-training data

	The model is pre-trained on a corpus of 458,293 documents collected from greek social media (Twitter, Instagram, Facebook and YouTube). A RoBERTa tokenizer trained from scratch on the same corpus is also included.

	The corpus has been provided by [Palo LTD](http://www.paloservices.com/)


	## Requirements

	```
	pip install transformers
	pip install torch

	```

	## Pre-processing details

	In order to use 'palobert-base-greek-social-media', the text needs to be pre-processed as follows:

	* remove all greek diacritics
	* convert to lowercase
	* remove all punctuation

	```python
	import re
	import unicodedata

	def preprocess(text, default_replace=""):
	text = text.lower()
	text = unicodedata.normalize('NFD',text).translate({ord('\N{COMBINING ACUTE ACCENT}'):None})
	text = re.sub(r'[^\w\s]', default_replace, text)
	return text
	```

	## Load Model

	```python
	from transformers import AutoTokenizer, AutoModelForMaskedLM

	tokenizer = AutoTokenizer.from_pretrained("pchatz/palobert-base-greek-social-media")

	model = AutoModelForMaskedLM.from_pretrained("pchatz/palobert-base-greek-social-media")
	```
	You can use this model directly with a pipeline for masked language modeling

	```python
	from transformers import pipeline

	fill = pipeline('fill-mask', model=model, tokenizer=tokenizer)
	fill(f'μεσα {fill.tokenizer.mask_token} δικτυωσης')

	[{'score': 0.8760559558868408,
	'token': 12853,
	'token_str': ' κοινωνικης',
	'sequence': 'μεσα κοινωνικης δικτυωσης'},
	{'score': 0.020922638475894928,
	'token': 1104,
	'token_str': ' μεσα',
	'sequence': 'μεσα μεσα δικτυωσης'},
	{'score': 0.017568595707416534,
	'token': 337,
	'token_str': ' της',
	'sequence': 'μεσα της δικτυωσης'},
	{'score': 0.006678201723843813,
	'token': 1258,
	'token_str': 'τικης',
	'sequence': 'μεσατικης δικτυωσης'},
	{'score': 0.004737381357699633,
	'token': 16245,
	'token_str': 'τερης',
	'sequence': 'μεσατερης δικτυωσης'}]
	```

	## Evaluation on MLM and Sentiment Analysis tasks

	For detailed results refer to Thesis: ['Ανάλυση συναισθήματος κειμένου στα Ελληνικά με χρήση Δικτύων Μετασχηματιστών']( http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18623) (version - p2)

	## Author

	Pavlina Chatziantoniou, Georgios Alexandridis and Athanasios Voulodimos

	## Citation info

	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18623