Text Classification
Safetensors
xlm-roberta
File size: 3,143 Bytes
8d46d5e
692ada7
 
 
 
 
 
 
 
 
 
 
 
 
 
8d46d5e
 
692ada7
8d46d5e
692ada7
8d46d5e
e68d741
8d46d5e
 
 
692ada7
8d46d5e
 
692ada7
 
 
 
 
 
8d46d5e
 
692ada7
8d46d5e
692ada7
 
 
 
 
8d46d5e
692ada7
 
 
8d46d5e
692ada7
 
 
8d46d5e
692ada7
 
 
8d46d5e
692ada7
 
 
8d46d5e
 
692ada7
 
 
 
 
 
 
 
 
8d46d5e
692ada7
 
 
 
8d46d5e
692ada7
 
8d46d5e
692ada7
 
 
 
8d46d5e
692ada7
 
 
 
 
8d46d5e
 
 
 
692ada7
8d46d5e
692ada7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
---
language:
- en
- es
- ja
- el
widget:
- text: It is great to see athletes promoting awareness for climate change.
datasets:
- cardiffnlp/tweet_topic_multi
- cardiffnlp/tweet_topic_multilingual
license: mit
metrics:
- f1
pipeline_tag: text-classification
---

# tweet-topic-large-multilingual

This model is based on  [cardiffnlp/twitter-xlm-roberta-large-2022](https://huggingface.co/cardiffnlp/twitter-xlm-roberta-large-2022) language model and isfinetuned for multi-label topic classification in English, Spanish, Japanese, and Greek.

The models is trained using [TweetTopic](https://huggingface.co/datasets/cardiffnlp/tweet_topic_multi) and [X-Topic](https://huggingface.co/datasets/cardiffnlp/tweet_topic_multilingual) datasets (see main [EMNLP 2024 reference paper](https://arxiv.org/abs/2410.03075).



<b>Labels</b>: 


| <span style="font-weight:normal">0: arts_&_culture</span>           | <span style="font-weight:normal">5: fashion_&_style</span>   | <span style="font-weight:normal">10: learning_&_educational</span>  | <span style="font-weight:normal">15: science_&_technology</span>  |
|-----------------------------|---------------------|----------------------------|--------------------------|
| 1: business_&_entrepreneurs | 6: film_tv_&_video  | 11: music                  | 16: sports               |
| 2: celebrity_&_pop_culture  | 7: fitness_&_health | 12: news_&_social_concern  | 17: travel_&_adventure   |
| 3: diaries_&_daily_life     | 8: food_&_dining    | 13: other_hobbies          | 18: youth_&_student_life |
| 4: family                   | 9: gaming           | 14: relationships          |                          |


## Full classification example

```python
from transformers import AutoModelForSequenceClassification, TFAutoModelForSequenceClassification
from transformers import AutoTokenizer
import numpy as np
from scipy.special import expit

    
MODEL = f"cardiffnlp/tweet-topic-large-multilingual"
tokenizer = AutoTokenizer.from_pretrained(MODEL)

# PT
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
class_mapping = model.config.id2label

text = "It is great to see athletes promoting awareness for climate change."
tokens = tokenizer(text, return_tensors='pt')
output = model(**tokens)

scores = output[0][0].detach().numpy()
scores = expit(scores)
predictions = (scores >= 0.5) * 1


# TF
#tf_model = TFAutoModelForSequenceClassification.from_pretrained(MODEL)
#class_mapping = tf_model.config.id2label
#text = "It is great to see athletes promoting awareness for climate change."
#tokens = tokenizer(text, return_tensors='tf')
#output = tf_model(**tokens)
#scores = output[0][0]
#scores = expit(scores)
#predictions = (scores >= 0.5) * 1

# Map to classes
for i in range(len(predictions)):
  if predictions[i]:
    print(class_mapping[i])

```
Output: 

```
news_&_social_concern
sports
```

## Results on X-Topic
|       | English | Spanish | Japanese | Greek |
|--------------|---------|---------|----------|-------|
| **Macro-F1** | 60.2    | 52.9    | 57.3     | 50.3  |
| **Micro-F1** | 66.3    | 67.0    | 61.4     | 73.0  |




## BibTeX entry and citation info

TBA