cvcio
/

roberta-el-uncased-twitter-v1

Inference Endpoints

Model card Files Files and versions Community

roberta-el-uncased-twitter-v1 / README.md

andefined's picture

model card

060762c almost 3 years ago

|

raw history blame contribute delete

No virus

2.71 kB

	---
	language: el
	tags:
	- roberta
	- twitter
	- Greek
	widget:
	- text: "<mask>: μεγαλη υποχωρηση του ιικου φορτιου σε αττικη και θεσσαλονικη"
	---

	# Greek RoBERTa Uncased (v1)

	Pretrained model on Greek language using a masked language modeling (MLM) objective using [Hugging Face's](https://huggingface.co/) [Transformers](https://github.com/huggingface/transformers) library. This model is case-sensitive and has no Greek diacritics (uncased, no-accents).

	### Training data

	This model was pretrained on almost 18M unique tweets, all Greek, collected between 2008-2021, from almost 450K distinct users.

	### Preprocessing

	The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50256. For the tokenizer we splited strings containing any numbers (ex. EU2019 ==> EU 2019). The tweet normalization logic described in the example listed bellow.

	```python
	import unicodedata
	from transformers import pipeline

	def normalize_tweet(tweet, do_lower = True, do_strip_accents = True, do_split_word_numbers = False, user_fill = '', url_fill = ''):
	# your tweet pre-processing logic goes here
	# example...

	# remove extra spaces, escape HTML, replace non-standard punctuation
	# replace any @user with blank
	# replace any link with blank
	# explode hashtags to strings (ex. #EU2019 ==> EU 2019)
	# remove all emojis

	# if do_split_word_numbers:
	# splited strings containing any numbers

	# standardize punctuation
	# remove unicode symbols

	if do_lower:
	tweet = tweet.lower()
	if do_strip_accents:
	tweet = strip_accents(tweet)

	return tweet.strip()

	def strip_accents(s):
	return ''.join(c for c in unicodedata.normalize('NFD', s)
	if unicodedata.category(c) != 'Mn')

	nlp = pipeline('fill-mask', model = 'cvcio/roberta-el-uncased-twitter-v1')

	print(
	nlp(
	normalize_tweet(
	'<mask>: Μεγάλη υποχώρηση του ιικού φορτίου σε Αττική και Θεσσαλονίκη'
	)
	)
	)
	```

	### Pretraining

	The model was pretrained on a T4 GPU for 1.2M steps with a batch size of 96 and a sequence length of 96. The optimizer used is Adam with a learning rate of 1e-5, gradient accumulation steps of 8, learning rate warmup for 50000 steps and linear decay of the learning rate after.

	### Authors

	Dimitris Papaevagelou - [@andefined](https://github.com/andefined)

	### About Us

	[Civic Information Office](https://cvcio.org/) is a Non Profit Organization based in Athens, Greece focusing on creating technology and research products for the public interest.