|
# tweet-topic-21-single |
|
|
|
This is a roBERTa-base model trained on ~124M tweets from January 2018 to December 2021 (see [here](https://huggingface.co/cardiffnlp/twitter-roberta-base-2021-124m)), and finetuned for single-label topic classification on tweets. |
|
The original roBERTa-base model can be found [here](https://huggingface.co/cardiffnlp/twitter-roberta-base-2021-124m) and the original reference paper is [TweetEval](https://github.com/cardiffnlp/tweeteval). This model is suitable for English. |
|
|
|
- Reference Paper: [TimeLMs paper](https://arxiv.org/abs/2202.03829). |
|
- Git Repo: [TimeLMs official repository](https://github.com/cardiffnlp/timelms). |
|
|
|
<b>Labels</b>: |
|
0 -> arts_&_culture |
|
1 -> business_&_entrepreneurs |
|
2 -> pop_culture |
|
3 -> daily_life |
|
4 -> sports_&_gaming |
|
5 -> science_&_technology |
|
|
|
|
|
## Full classification example |
|
|
|
```python |
|
from transformers import AutoModelForSequenceClassification |
|
from transformers import AutoTokenizer |
|
import numpy as np |
|
from scipy.special import softmax |
|
|
|
|
|
MODEL = f"antypasd/tweet-topic-21-single" |
|
tokenizer = AutoTokenizer.from_pretrained(MODEL) |
|
|
|
# PT |
|
model = AutoModelForSequenceClassification.from_pretrained(MODEL) |
|
class_mapping = model.config.id2label |
|
|
|
text = "Tesla stock is on the rise!" |
|
encoded_input = tokenizer(text, return_tensors='pt') |
|
output = model(**encoded_input) |
|
|
|
output = model(**encoded_input) |
|
scores = output[0][0].detach().numpy() |
|
scores = softmax(scores) |
|
|
|
ranking = np.argsort(scores) |
|
ranking = ranking[::-1] |
|
for i in range(scores.shape[0]): |
|
l = class_mapping[ranking[i]] |
|
s = scores[ranking[i]] |
|
print(f"{i+1}) {l} {np.round(float(s), 4)}") |
|
|
|
``` |
|
|
|
Output: |
|
|
|
``` |
|
1) business_&_entrepreneurs 0.8361 |
|
2) science_&_technology 0.0904 |
|
3) pop_culture 0.0288 |
|
4) daily_life 0.0178 |
|
5) arts_&_culture 0.0137 |
|
6) sports_&_gaming 0.0133 |
|
``` |