File size: 3,143 Bytes
8d46d5e 692ada7 8d46d5e 692ada7 8d46d5e 692ada7 8d46d5e e68d741 8d46d5e 692ada7 8d46d5e 692ada7 8d46d5e 692ada7 8d46d5e 692ada7 8d46d5e 692ada7 8d46d5e 692ada7 8d46d5e 692ada7 8d46d5e 692ada7 8d46d5e 692ada7 8d46d5e 692ada7 8d46d5e 692ada7 8d46d5e 692ada7 8d46d5e 692ada7 8d46d5e 692ada7 8d46d5e 692ada7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 |
---
language:
- en
- es
- ja
- el
widget:
- text: It is great to see athletes promoting awareness for climate change.
datasets:
- cardiffnlp/tweet_topic_multi
- cardiffnlp/tweet_topic_multilingual
license: mit
metrics:
- f1
pipeline_tag: text-classification
---
# tweet-topic-large-multilingual
This model is based on [cardiffnlp/twitter-xlm-roberta-large-2022](https://huggingface.co/cardiffnlp/twitter-xlm-roberta-large-2022) language model and isfinetuned for multi-label topic classification in English, Spanish, Japanese, and Greek.
The models is trained using [TweetTopic](https://huggingface.co/datasets/cardiffnlp/tweet_topic_multi) and [X-Topic](https://huggingface.co/datasets/cardiffnlp/tweet_topic_multilingual) datasets (see main [EMNLP 2024 reference paper](https://arxiv.org/abs/2410.03075).
<b>Labels</b>:
| <span style="font-weight:normal">0: arts_&_culture</span> | <span style="font-weight:normal">5: fashion_&_style</span> | <span style="font-weight:normal">10: learning_&_educational</span> | <span style="font-weight:normal">15: science_&_technology</span> |
|-----------------------------|---------------------|----------------------------|--------------------------|
| 1: business_&_entrepreneurs | 6: film_tv_&_video | 11: music | 16: sports |
| 2: celebrity_&_pop_culture | 7: fitness_&_health | 12: news_&_social_concern | 17: travel_&_adventure |
| 3: diaries_&_daily_life | 8: food_&_dining | 13: other_hobbies | 18: youth_&_student_life |
| 4: family | 9: gaming | 14: relationships | |
## Full classification example
```python
from transformers import AutoModelForSequenceClassification, TFAutoModelForSequenceClassification
from transformers import AutoTokenizer
import numpy as np
from scipy.special import expit
MODEL = f"cardiffnlp/tweet-topic-large-multilingual"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
# PT
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
class_mapping = model.config.id2label
text = "It is great to see athletes promoting awareness for climate change."
tokens = tokenizer(text, return_tensors='pt')
output = model(**tokens)
scores = output[0][0].detach().numpy()
scores = expit(scores)
predictions = (scores >= 0.5) * 1
# TF
#tf_model = TFAutoModelForSequenceClassification.from_pretrained(MODEL)
#class_mapping = tf_model.config.id2label
#text = "It is great to see athletes promoting awareness for climate change."
#tokens = tokenizer(text, return_tensors='tf')
#output = tf_model(**tokens)
#scores = output[0][0]
#scores = expit(scores)
#predictions = (scores >= 0.5) * 1
# Map to classes
for i in range(len(predictions)):
if predictions[i]:
print(class_mapping[i])
```
Output:
```
news_&_social_concern
sports
```
## Results on X-Topic
| | English | Spanish | Japanese | Greek |
|--------------|---------|---------|----------|-------|
| **Macro-F1** | 60.2 | 52.9 | 57.3 | 50.3 |
| **Micro-F1** | 66.3 | 67.0 | 61.4 | 73.0 |
## BibTeX entry and citation info
TBA |