1 ---
2 language: el
3 tags:
4 - roberta
5 - twitter
6 - Greek
7 widget:
8 - text: "<mask>: μεγαλη υποχωρηση του ιικου φορτιου σε αττικη και θεσσαλονικη"
9 ---
10
11 # Greek RoBERTa Uncased (v1)
12
13 Pretrained model on Greek language using a masked language modeling (MLM) objective using [Hugging Face's](https://huggingface.co/) [Transformers](https://github.com/huggingface/transformers) library. This model is case-sensitive and has no Greek diacritics (uncased, no-accents).
14
15 ### Training data
16
17 This model was pretrained on almost 18M unique tweets, all Greek, collected between 2008-2021, from almost 450K distinct users.
18
19 ### Preprocessing
20
21 The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50256. For the tokenizer we splited strings containing any numbers (ex. EU2019 ==> EU 2019). The tweet normalization logic described in the example listed bellow.
22
23 ```python
24 import unicodedata
25 from transformers import pipeline
26
27 def normalize_tweet(tweet, do_lower = True, do_strip_accents = True, do_split_word_numbers = False, user_fill = '', url_fill = ''):
28 # your tweet pre-processing logic goes here
29 # example...
30
31 # remove extra spaces, escape HTML, replace non-standard punctuation
32 # replace any @user with blank
33 # replace any link with blank
34 # explode hashtags to strings (ex. #EU2019 ==> EU 2019)
35 # remove all emojis
36
37 # if do_split_word_numbers:
38 # splited strings containing any numbers
39
40 # standardize punctuation
41 # remove unicode symbols
42
43 if do_lower:
44 tweet = tweet.lower()
45 if do_strip_accents:
46 tweet = strip_accents(tweet)
47
48 return tweet.strip()
49
50 def strip_accents(s):
51 return ''.join(c for c in unicodedata.normalize('NFD', s)
52 if unicodedata.category(c) != 'Mn')
53
54 nlp = pipeline('fill-mask', model = 'cvcio/roberta-el-uncased-twitter-v1')
55
56 print(
57 nlp(
58 normalize_tweet(
59 '<mask>: Μεγάλη υποχώρηση του ιικού φορτίου σε Αττική και Θεσσαλονίκη'
60 )
61 )
62 )
63 ```
64
65 ### Pretraining
66
67 The model was pretrained on a T4 GPU for 1.2M steps with a batch size of 96 and a sequence length of 96. The optimizer used is Adam with a learning rate of 1e-5, gradient accumulation steps of 8, learning rate warmup for 50000 steps and linear decay of the learning rate after.
68
69 ### Authors
70
71 Dimitris Papaevagelou - [@andefined](https://github.com/andefined)
72
73 ### About Us
74
75 [Civic Information Office](https://cvcio.org/) is a Non Profit Organization based in Athens, Greece focusing on creating technology and research products for the public interest.
76