File size: 1,316 Bytes
9ad0c1d
ab65fea
9ad0c1d
ab65fea
1b419e5
ab65fea
8b48c2c
 
 
 
73c9ee0
8b48c2c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73c9ee0
 
8b48c2c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73c9ee0
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# Twitter-roBERTa-base

This is a roBERTa-base model trained on ~58M tweets, described and evaluated in the [_TweetEval_ benchmark (Findings of EMNLP 2020)](https://arxiv.org/pdf/2010.12421.pdf). To evaluate this and other LMs on Twitter-specific data, please refer to the [Tweeteval official repository](https://github.com/cardiffnlp/tweeteval).

## Example Masked Language Model 

```python
from transformers import pipeline, AutoTokenizer
import numpy as np

MODEL = "cardiffnlp/twitter-roberta-base"
fill_mask = pipeline("fill-mask", model=MODEL, tokenizer=MODEL)
tokenizer = AutoTokenizer.from_pretrained(MODEL)

def print_candidates():
    for i in range(5):
        token = tokenizer.decode(candidates[i]['token'])
        score = np.round(candidates[i]['score'], 4)
        print(f"{i+1}) {token} {score}")

texts = [
 "I am so <mask> ๐Ÿ˜Š",
 "I am so <mask> ๐Ÿ˜ข" 
]
for text in texts:
    print(f"{'-'*30}\n{text}")
    candidates = fill_mask(text)
    print_candidates()
```

Output: 

```
------------------------------
I am so <mask> ๐Ÿ˜Š
1)  happy 0.402
2)  excited 0.1441
3)  proud 0.143
4)  grateful 0.0669
5)  blessed 0.0334
------------------------------
I am so <mask> ๐Ÿ˜ข
1)  sad 0.2641
2)  sorry 0.1605
3)  tired 0.138
4)  sick 0.0278
5)  hungry 0.0232
```

## Example Feature Extraction 
TODO