Pedrada commited on
Commit
43f3d69
β€’
1 Parent(s): d851876

Add model card

Browse files
Files changed (1) hide show
  1. README.md +85 -0
README.md ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Twitter-scratch-roBERTa-base
2
+
3
+ This is a RoBERTa-base model trained from scratch on ~58M tweets, as described and evaluated in the [_TweetEval_ benchmark (Findings of EMNLP 2020)](https://arxiv.org/pdf/2010.12421.pdf).
4
+ To evaluate this and other LMs on Twitter-specific data, please refer to the [Tweeteval official repository](https://github.com/cardiffnlp/tweeteval).
5
+
6
+ ## Preprocess Text
7
+ Replace usernames and links for placeholders: "@user" and "http".
8
+ ```python
9
+ def preprocess(text):
10
+ new_text = []
11
+ for t in text.split(" "):
12
+ t = '@user' if t.startswith('@') and len(t) > 1 else t
13
+ t = 'http' if t.startswith('http') else t
14
+ new_text.append(t)
15
+ return " ".join(new_text)
16
+ ```
17
+
18
+ ## Example Masked Language Model
19
+
20
+ ```python
21
+ from transformers import pipeline, AutoTokenizer
22
+ import numpy as np
23
+
24
+ MODEL = "cardiffnlp/twitter-scratch-roberta-base"
25
+ fill_mask = pipeline("fill-mask", model=MODEL, tokenizer=MODEL)
26
+ tokenizer = AutoTokenizer.from_pretrained(MODEL)
27
+
28
+ def print_candidates():
29
+ for i in range(5):
30
+ token = tokenizer.decode(candidates[i]['token'])
31
+ score = np.round(candidates[i]['score'], 4)
32
+ print(f"{i+1}) {token} {score}")
33
+
34
+ texts = [
35
+ "I am so <mask> 😊",
36
+ "I am so <mask> 😒"
37
+ ]
38
+ for text in texts:
39
+ t = preprocess(text)
40
+ print(f"{'-'*30}\n{t}")
41
+ candidates = fill_mask(t)
42
+ print_candidates()
43
+ ```
44
+
45
+ Output:
46
+
47
+ ```
48
+ ------------------------------
49
+ I am so <mask> 😊
50
+ 1) happy 0.530
51
+ 2) grateful 0.083
52
+ 3) excited 0.078
53
+ 4) thankful 0.053
54
+ 5) blessed 0.041
55
+ ------------------------------
56
+ I am so <mask> 😒
57
+ 1) sad 0.439
58
+ 2) sorry 0.088
59
+ 3) tired 0.045
60
+ 4) hurt 0.026
61
+ 5) upset 0.026
62
+ ```
63
+
64
+
65
+ ### BibTeX entry and citation info
66
+
67
+ Please cite the [reference paper](https://aclanthology.org/2020.findings-emnlp.148/) if you use this model.
68
+
69
+ ```bibtex
70
+ @inproceedings{barbieri-etal-2020-tweeteval,
71
+ title = "{T}weet{E}val: Unified Benchmark and Comparative Evaluation for Tweet Classification",
72
+ author = "Barbieri, Francesco and
73
+ Camacho-Collados, Jose and
74
+ Espinosa Anke, Luis and
75
+ Neves, Leonardo",
76
+ booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
77
+ month = nov,
78
+ year = "2020",
79
+ address = "Online",
80
+ publisher = "Association for Computational Linguistics",
81
+ url = "https://aclanthology.org/2020.findings-emnlp.148",
82
+ doi = "10.18653/v1/2020.findings-emnlp.148",
83
+ pages = "1644--1650"
84
+ }
85
+ ```