|
# tweet-topic-21-single |
|
|
|
This is a RoBERTa-base model trained on ~124M tweets from January 2018 to December 2021 (see [here](https://huggingface.co/cardiffnlp/twitter-roberta-base-2021-124m)), and finetuned for single-label topic classification on a corpus of 6,997 [tweets](https://huggingface.co/datasets/cardiffnlp/tweet_topic_single). |
|
The original roBERTa-base model can be found [here](https://huggingface.co/cardiffnlp/twitter-roberta-base-2021-124m) and the original reference paper is [TweetEval](https://github.com/cardiffnlp/tweeteval). This model is suitable for English. |
|
|
|
- Reference Papers: [TimeLMs paper](https://arxiv.org/abs/2202.03829), [TweetTopic](https://arxiv.org/abs/2209.09824). |
|
- Git Repo: [TimeLMs official repository](https://github.com/cardiffnlp/timelms). |
|
|
|
<b>Labels</b>: |
|
- 0 -> arts_&_culture; |
|
- 1 -> business_&_entrepreneurs; |
|
- 2 -> pop_culture; |
|
- 3 -> daily_life; |
|
- 4 -> sports_&_gaming; |
|
- 5 -> science_&_technology |
|
|
|
|
|
## Full classification example |
|
|
|
```python |
|
from transformers import AutoModelForSequenceClassification, TFAutoModelForSequenceClassification |
|
from transformers import AutoTokenizer |
|
import numpy as np |
|
from scipy.special import softmax |
|
|
|
|
|
MODEL = f"cardiffnlp/tweet-topic-21-single" |
|
tokenizer = AutoTokenizer.from_pretrained(MODEL) |
|
|
|
# PT |
|
model = AutoModelForSequenceClassification.from_pretrained(MODEL) |
|
class_mapping = model.config.id2label |
|
|
|
text = "Tesla stock is on the rise!" |
|
encoded_input = tokenizer(text, return_tensors='pt') |
|
output = model(**encoded_input) |
|
|
|
scores = output[0][0].detach().numpy() |
|
scores = softmax(scores) |
|
|
|
# TF |
|
#model = TFAutoModelForSequenceClassification.from_pretrained(MODEL) |
|
#class_mapping = model.config.id2label |
|
#text = "Tesla stock is on the rise!" |
|
#encoded_input = tokenizer(text, return_tensors='tf') |
|
#output = model(**encoded_input) |
|
#scores = output[0][0] |
|
#scores = softmax(scores) |
|
|
|
|
|
ranking = np.argsort(scores) |
|
ranking = ranking[::-1] |
|
for i in range(scores.shape[0]): |
|
l = class_mapping[ranking[i]] |
|
s = scores[ranking[i]] |
|
print(f"{i+1}) {l} {np.round(float(s), 4)}") |
|
|
|
``` |
|
|
|
Output: |
|
|
|
``` |
|
1) business_&_entrepreneurs 0.8361 |
|
2) science_&_technology 0.0904 |
|
3) pop_culture 0.0288 |
|
4) daily_life 0.0178 |
|
5) arts_&_culture 0.0137 |
|
6) sports_&_gaming 0.0133 |
|
``` |