danlou commited on
Commit
59a819e
1 Parent(s): becb084

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +160 -0
README.md ADDED
@@ -0,0 +1,160 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ tags:
4
+ - timelms
5
+ - twitter
6
+ license: mit
7
+ datasets:
8
+ - twitter-api
9
+ ---
10
+
11
+ # Twitter June 2022 (RoBERTa-base, 154M)
12
+
13
+ This is a RoBERTa-base model trained on 153.86M tweets until the end of June 2022 (15M tweets increment).
14
+ More details and performance scores are available in the [TimeLMs paper](https://arxiv.org/abs/2202.03829).
15
+
16
+ Below, we provide some usage examples using the standard Transformers interface. For another interface more suited to comparing predictions and perplexity scores between models trained at different temporal intervals, check the [TimeLMs repository](https://github.com/cardiffnlp/timelms).
17
+
18
+ For other models trained until different periods, check this [table](https://github.com/cardiffnlp/timelms#released-models).
19
+
20
+ ## Preprocess Text
21
+ Replace usernames and links for placeholders: "@user" and "http".
22
+ If you're interested in retaining verified users which were also retained during training, you may keep the users listed [here](https://github.com/cardiffnlp/timelms/tree/main/data).
23
+ ```python
24
+ def preprocess(text):
25
+ preprocessed_text = []
26
+ for t in text.split():
27
+ if len(t) > 1:
28
+ t = '@user' if t[0] == '@' and t.count('@') == 1 else t
29
+ t = 'http' if t.startswith('http') else t
30
+ preprocessed_text.append(t)
31
+ return ' '.join(preprocessed_text)
32
+ ```
33
+
34
+ ## Example Masked Language Model
35
+
36
+ ```python
37
+ from transformers import pipeline, AutoTokenizer
38
+
39
+ MODEL = "cardiffnlp/twitter-roberta-base-jun2022-15M-incr"
40
+ fill_mask = pipeline("fill-mask", model=MODEL, tokenizer=MODEL)
41
+ tokenizer = AutoTokenizer.from_pretrained(MODEL)
42
+
43
+ def pprint(candidates, n):
44
+ for i in range(n):
45
+ token = tokenizer.decode(candidates[i]['token'])
46
+ score = candidates[i]['score']
47
+ print("%d) %.5f %s" % (i+1, score, token))
48
+
49
+ texts = [
50
+ "So glad I'm <mask> vaccinated.",
51
+ "I keep forgetting to bring a <mask>.",
52
+ "Looking forward to watching <mask> Game tonight!",
53
+ ]
54
+ for text in texts:
55
+ t = preprocess(text)
56
+ print(f"{'-'*30}\n{t}")
57
+ candidates = fill_mask(t)
58
+ pprint(candidates, 5)
59
+ ```
60
+
61
+ Output:
62
+
63
+ ```
64
+ ------------------------------
65
+ So glad I'm <mask> vaccinated.
66
+ 1) 0.48904 not
67
+ 2) 0.19832 fully
68
+ 3) 0.13791 getting
69
+ 4) 0.02852 still
70
+ 5) 0.01900 triple
71
+ ------------------------------
72
+ I keep forgetting to bring a <mask>.
73
+ 1) 0.05997 backpack
74
+ 2) 0.05158 charger
75
+ 3) 0.05071 book
76
+ 4) 0.04741 lighter
77
+ 5) 0.03621 bag
78
+ ------------------------------
79
+ Looking forward to watching <mask> Game tonight!
80
+ 1) 0.54114 the
81
+ 2) 0.23145 The
82
+ 3) 0.01682 this
83
+ 4) 0.01435 Squid
84
+ 5) 0.01300 End
85
+ ```
86
+
87
+ ## Example Tweet Embeddings
88
+ ```python
89
+ from transformers import AutoTokenizer, AutoModel, TFAutoModel
90
+ import numpy as np
91
+ from scipy.spatial.distance import cosine
92
+ from collections import Counter
93
+
94
+ def get_embedding(text): # naive approach for demonstration
95
+ text = preprocess(text)
96
+ encoded_input = tokenizer(text, return_tensors='pt')
97
+ features = model(**encoded_input)
98
+ features = features[0].detach().cpu().numpy()
99
+ return np.mean(features[0], axis=0)
100
+
101
+
102
+ MODEL = "cardiffnlp/twitter-roberta-base-jun2022-15M-incr"
103
+ tokenizer = AutoTokenizer.from_pretrained(MODEL)
104
+ model = AutoModel.from_pretrained(MODEL)
105
+
106
+ query = "The book was awesome"
107
+ tweets = ["I just ordered fried chicken 🐣",
108
+ "The movie was great",
109
+ "What time is the next game?",
110
+ "Just finished reading 'Embeddings in NLP'"]
111
+
112
+ sims = Counter()
113
+ for tweet in tweets:
114
+ sim = 1 - cosine(get_embedding(query), get_embedding(tweet))
115
+ sims[tweet] = sim
116
+
117
+ print('Most similar to: ', query)
118
+ print(f"{'-'*30}")
119
+ for idx, (tweet, sim) in enumerate(sims.most_common()):
120
+ print("%d) %.5f %s" % (idx+1, sim, tweet))
121
+ ```
122
+ Output:
123
+
124
+ ```
125
+ Most similar to: The book was awesome
126
+ ------------------------------
127
+ 1) 0.98878 The movie was great
128
+ 2) 0.96100 Just finished reading 'Embeddings in NLP'
129
+ 3) 0.94927 I just ordered fried chicken 🐣
130
+ 4) 0.94668 What time is the next game?
131
+ ```
132
+
133
+ ## Example Feature Extraction
134
+
135
+ ```python
136
+ from transformers import AutoTokenizer, AutoModel, TFAutoModel
137
+ import numpy as np
138
+
139
+ MODEL = "cardiffnlp/twitter-roberta-base-jun2022-15M-incr"
140
+ tokenizer = AutoTokenizer.from_pretrained(MODEL)
141
+
142
+ text = "Good night 😊"
143
+ text = preprocess(text)
144
+
145
+ # Pytorch
146
+ model = AutoModel.from_pretrained(MODEL)
147
+ encoded_input = tokenizer(text, return_tensors='pt')
148
+ features = model(**encoded_input)
149
+ features = features[0].detach().cpu().numpy()
150
+ features_mean = np.mean(features[0], axis=0)
151
+ #features_max = np.max(features[0], axis=0)
152
+
153
+ # # Tensorflow
154
+ # model = TFAutoModel.from_pretrained(MODEL)
155
+ # encoded_input = tokenizer(text, return_tensors='tf')
156
+ # features = model(encoded_input)
157
+ # features = features[0].numpy()
158
+ # features_mean = np.mean(features[0], axis=0)
159
+ # #features_max = np.max(features[0], axis=0)
160
+ ```