andefined commited on
Commit
060762c
1 Parent(s): c0c1feb

model card

Browse files
Files changed (5) hide show
  1. README.md +30 -15
  2. all_results.json +0 -14
  3. train_results.json +0 -14
  4. trainer_state.json +0 -0
  5. training_args.bin +0 -3
README.md CHANGED
@@ -1,14 +1,24 @@
1
  ---
2
  language: el
 
 
 
 
3
  widget:
4
  - text: "<mask>: μεγαλη υποχωρηση του ιικου φορτιου σε αττικη και θεσσαλονικη"
5
  ---
6
 
7
  # Greek RoBERTa Uncased (v1)
8
 
9
- This is a RoBERTa model (uncased, no-accents), trained on ~18M Greek tweets from ~450K distinct users.
10
 
11
- ### Usage
 
 
 
 
 
 
12
 
13
  ```python
14
  import unicodedata
@@ -18,19 +28,17 @@ def normalize_tweet(tweet, do_lower = True, do_strip_accents = True, do_split_wo
18
  # your tweet pre-processing logic goes here
19
  # example...
20
 
21
- # tweet = standardize_text(tweet)
22
- # tweet = split_quote_directive(tweet)
23
- # tweet = remove_directives(tweet)
24
- # tweet = replace_users(tweet, user_fill)
25
- # tweet = replace_links(tweet, url_fill)
26
- # tweet = explode_hashtags(tweet)
27
- # tweet = remove_emojis(tweet)
28
 
29
  # if do_split_word_numbers:
30
- # tweet = split_word_numbers(tweet)
31
 
32
- # tweet = standardize_punctuation(tweet)
33
- # tweet = remove_unicode_symbols(tweet)
34
 
35
  if do_lower:
36
  tweet = tweet.lower()
@@ -48,13 +56,20 @@ nlp = pipeline('fill-mask', model = 'cvcio/roberta-el-uncased-twitter-v1')
48
  print(
49
  nlp(
50
  normalize_tweet(
51
- '<mask>: μεγαλη υποχωρηση του ιικου φορτιου σε αττικη και θεσσαλονικη'
52
  )
53
  )
54
  )
55
  ```
56
 
 
 
 
 
 
 
 
57
 
58
- ## Authors
59
 
60
- Dimitris Papaevagelou - Github: [@andefined](https://github.com/andefined), Twitter: [@andefined](https://twitter.com/andefined)
1
  ---
2
  language: el
3
+ tags:
4
+ - roberta
5
+ - twitter
6
+ - Greek
7
  widget:
8
  - text: "<mask>: μεγαλη υποχωρηση του ιικου φορτιου σε αττικη και θεσσαλονικη"
9
  ---
10
 
11
  # Greek RoBERTa Uncased (v1)
12
 
13
+ Pretrained model on Greek language using a masked language modeling (MLM) objective using [Hugging Face's](https://huggingface.co/) [Transformers](https://github.com/huggingface/transformers) library. This model is case-sensitive and has no Greek diacritics (uncased, no-accents).
14
 
15
+ ### Training data
16
+
17
+ This model was pretrained on almost 18M unique tweets, all Greek, collected between 2008-2021, from almost 450K distinct users.
18
+
19
+ ### Preprocessing
20
+
21
+ The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50256. For the tokenizer we splited strings containing any numbers (ex. EU2019 ==> EU 2019). The tweet normalization logic described in the example listed bellow.
22
 
23
  ```python
24
  import unicodedata
28
  # your tweet pre-processing logic goes here
29
  # example...
30
 
31
+ # remove extra spaces, escape HTML, replace non-standard punctuation
32
+ # replace any @user with blank
33
+ # replace any link with blank
34
+ # explode hashtags to strings (ex. #EU2019 ==> EU 2019)
35
+ # remove all emojis
 
 
36
 
37
  # if do_split_word_numbers:
38
+ # splited strings containing any numbers
39
 
40
+ # standardize punctuation
41
+ # remove unicode symbols
42
 
43
  if do_lower:
44
  tweet = tweet.lower()
56
  print(
57
  nlp(
58
  normalize_tweet(
59
+ '<mask>: Μεγάλη υποχώρηση του ιικού φορτίου σε Αττική και Θεσσαλονίκη'
60
  )
61
  )
62
  )
63
  ```
64
 
65
+ ### Pretraining
66
+
67
+ The model was pretrained on a T4 GPU for 1.2M steps with a batch size of 96 and a sequence length of 96. The optimizer used is Adam with a learning rate of 1e-5, gradient accumulation steps of 8, learning rate warmup for 50000 steps and linear decay of the learning rate after.
68
+
69
+ ### Authors
70
+
71
+ Dimitris Papaevagelou - [@andefined](https://github.com/andefined)
72
 
73
+ ### About Us
74
 
75
+ [Civic Information Office](https://cvcio.org/) is a Non Profit Organization based in Athens, Greece focusing on creating technology and research products for the public interest.
all_results.json DELETED
@@ -1,14 +0,0 @@
1
- {
2
- "epoch": 30.0,
3
- "init_mem_cpu_alloc_delta": 2254888960,
4
- "init_mem_cpu_peaked_delta": 380518400,
5
- "init_mem_gpu_alloc_delta": 500087808,
6
- "init_mem_gpu_peaked_delta": 0,
7
- "train_mem_cpu_alloc_delta": 754831360,
8
- "train_mem_cpu_peaked_delta": 377417728,
9
- "train_mem_gpu_alloc_delta": 1529783296,
10
- "train_mem_gpu_peaked_delta": 11636327936,
11
- "train_runtime": 565670.9604,
12
- "train_samples": 2708175,
13
- "train_samples_per_second": 0.374
14
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
train_results.json DELETED
@@ -1,14 +0,0 @@
1
- {
2
- "epoch": 30.0,
3
- "init_mem_cpu_alloc_delta": 2254888960,
4
- "init_mem_cpu_peaked_delta": 380518400,
5
- "init_mem_gpu_alloc_delta": 500087808,
6
- "init_mem_gpu_peaked_delta": 0,
7
- "train_mem_cpu_alloc_delta": 754831360,
8
- "train_mem_cpu_peaked_delta": 377417728,
9
- "train_mem_gpu_alloc_delta": 1529783296,
10
- "train_mem_gpu_peaked_delta": 11636327936,
11
- "train_runtime": 565670.9604,
12
- "train_samples": 2708175,
13
- "train_samples_per_second": 0.374
14
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
trainer_state.json DELETED
The diff for this file is too large to render. See raw diff
training_args.bin DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:ea47778f2402c5e66831a9486624c160b0723d302c1cb53bf931b3c6f0ccfd21
3
- size 2415