danlou commited on
Commit
05a58c8
1 Parent(s): ff26480

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +162 -0
README.md ADDED
@@ -0,0 +1,162 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ tags:
4
+ - timelms
5
+ - twitter
6
+ license: mit
7
+ datasets:
8
+ - twitter-api
9
+ ---
10
+
11
+ # Twitter 2022 154M (RoBERTa-large, 154M - full update)
12
+
13
+ This is a RoBERTa-large model trained on 154M tweets until the end of December 2022 (from original checkpoint, no incremental updates).
14
+
15
+ These 154M tweets result from filtering 220M tweets obtained exclusively from the Twitter Academic API, covering every month between 2018-01 and 2022-12.
16
+ Filtering and preprocessing details are available in the [TimeLMs paper](https://arxiv.org/abs/2202.03829).
17
+
18
+ Below, we provide some usage examples using the standard Transformers interface. For another interface more suited to comparing predictions and perplexity scores between models trained at different temporal intervals, check the [TimeLMs repository](https://github.com/cardiffnlp/timelms).
19
+
20
+ For other models trained until different periods, check this [table](https://github.com/cardiffnlp/timelms#released-models).
21
+
22
+ ## Preprocess Text
23
+ Replace usernames and links for placeholders: "@user" and "http".
24
+ If you're interested in retaining verified users which were also retained during training, you may keep the users listed [here](https://github.com/cardiffnlp/timelms/tree/main/data).
25
+ ```python
26
+ def preprocess(text):
27
+ preprocessed_text = []
28
+ for t in text.split():
29
+ if len(t) > 1:
30
+ t = '@user' if t[0] == '@' and t.count('@') == 1 else t
31
+ t = 'http' if t.startswith('http') else t
32
+ preprocessed_text.append(t)
33
+ return ' '.join(preprocessed_text)
34
+ ```
35
+
36
+ ## Example Masked Language Model
37
+
38
+ ```python
39
+ from transformers import pipeline, AutoTokenizer
40
+
41
+ MODEL = "cardiffnlp/twitter-roberta-base-2022-154m"
42
+ fill_mask = pipeline("fill-mask", model=MODEL, tokenizer=MODEL)
43
+ tokenizer = AutoTokenizer.from_pretrained(MODEL)
44
+
45
+ def pprint(candidates, n):
46
+ for i in range(n):
47
+ token = tokenizer.decode(candidates[i]['token'])
48
+ score = candidates[i]['score']
49
+ print("%d) %.5f %s" % (i+1, score, token))
50
+
51
+ texts = [
52
+ "So glad I'm <mask> vaccinated.",
53
+ "I keep forgetting to bring a <mask>.",
54
+ "Looking forward to watching <mask> Game tonight!",
55
+ ]
56
+ for text in texts:
57
+ t = preprocess(text)
58
+ print(f"{'-'*30}\n{t}")
59
+ candidates = fill_mask(t)
60
+ pprint(candidates, 5)
61
+ ```
62
+
63
+ Output:
64
+
65
+ ```
66
+ ------------------------------
67
+ So glad I'm <mask> vaccinated.
68
+ 1) 0.26251 not
69
+ 2) 0.25460 a
70
+ 3) 0.12611 in
71
+ 4) 0.11036 the
72
+ 5) 0.04210 getting
73
+ ------------------------------
74
+ I keep forgetting to bring a <mask>.
75
+ 1) 0.09274 charger
76
+ 2) 0.04727 lighter
77
+ 3) 0.04469 mask
78
+ 4) 0.04395 drink
79
+ 5) 0.03644 camera
80
+ ------------------------------
81
+ Looking forward to watching <mask> Game tonight!
82
+ 1) 0.57683 Squid
83
+ 2) 0.17419 The
84
+ 3) 0.04198 the
85
+ 4) 0.00970 Spring
86
+ 5) 0.00921 Big
87
+ ```
88
+
89
+ ## Example Tweet Embeddings
90
+ ```python
91
+ from transformers import AutoTokenizer, AutoModel, TFAutoModel
92
+ import numpy as np
93
+ from scipy.spatial.distance import cosine
94
+ from collections import Counter
95
+
96
+ def get_embedding(text): # naive approach for demonstration
97
+ text = preprocess(text)
98
+ encoded_input = tokenizer(text, return_tensors='pt')
99
+ features = model(**encoded_input)
100
+ features = features[0].detach().cpu().numpy()
101
+ return np.mean(features[0], axis=0)
102
+
103
+
104
+ MODEL = "cardiffnlp/twitter-roberta-base-2022-154m"
105
+ tokenizer = AutoTokenizer.from_pretrained(MODEL)
106
+ model = AutoModel.from_pretrained(MODEL)
107
+
108
+ query = "The book was awesome"
109
+ tweets = ["I just ordered fried chicken 🐣",
110
+ "The movie was great",
111
+ "What time is the next game?",
112
+ "Just finished reading 'Embeddings in NLP'"]
113
+
114
+ sims = Counter()
115
+ for tweet in tweets:
116
+ sim = 1 - cosine(get_embedding(query), get_embedding(tweet))
117
+ sims[tweet] = sim
118
+
119
+ print('Most similar to: ', query)
120
+ print(f"{'-'*30}")
121
+ for idx, (tweet, sim) in enumerate(sims.most_common()):
122
+ print("%d) %.5f %s" % (idx+1, sim, tweet))
123
+ ```
124
+ Output:
125
+
126
+ ```
127
+ Most similar to: The book was awesome
128
+ ------------------------------
129
+ 1) 0.99403 The movie was great
130
+ 2) 0.98006 Just finished reading 'Embeddings in NLP'
131
+ 3) 0.97314 What time is the next game?
132
+ 4) 0.92448 I just ordered fried chicken 🐣
133
+ ```
134
+
135
+ ## Example Feature Extraction
136
+
137
+ ```python
138
+ from transformers import AutoTokenizer, AutoModel, TFAutoModel
139
+ import numpy as np
140
+
141
+ MODEL = "cardiffnlp/twitter-roberta-base-2022-154m"
142
+ tokenizer = AutoTokenizer.from_pretrained(MODEL)
143
+
144
+ text = "Good night 😊"
145
+ text = preprocess(text)
146
+
147
+ # Pytorch
148
+ model = AutoModel.from_pretrained(MODEL)
149
+ encoded_input = tokenizer(text, return_tensors='pt')
150
+ features = model(**encoded_input)
151
+ features = features[0].detach().cpu().numpy()
152
+ features_mean = np.mean(features[0], axis=0)
153
+ #features_max = np.max(features[0], axis=0)
154
+
155
+ # # Tensorflow
156
+ # model = TFAutoModel.from_pretrained(MODEL)
157
+ # encoded_input = tokenizer(text, return_tensors='tf')
158
+ # features = model(encoded_input)
159
+ # features = features[0].numpy()
160
+ # features_mean = np.mean(features[0], axis=0)
161
+ # #features_max = np.max(features[0], axis=0)
162
+ ```