--- language: - en - es - ja - el widget: - text: It is great to see athletes promoting awareness for climate change. datasets: - cardiffnlp/tweet_topic_multi - cardiffnlp/tweet_topic_multilingual license: mit metrics: - f1 pipeline_tag: text-classification --- # tweet-topic-large-multilingual This model is based on [cardiffnlp/twitter-xlm-roberta-large-2022](https://huggingface.co/cardiffnlp/twitter-xlm-roberta-large-2022) language model and isfinetuned for multi-label topic classification in English, Spanish, Japanese, and Greek. The models is trained using [TweetTopic](https://huggingface.co/datasets/cardiffnlp/tweet_topic_multi) and [X-Topic](https://huggingface.co/datasets/cardiffnlp/tweet_topic_multilingual) datasets (see main [EMNLP 2024 reference paper](https://arxiv.org/abs/2410.03075). Labels: | 0: arts_&_culture | 5: fashion_&_style | 10: learning_&_educational | 15: science_&_technology | |-----------------------------|---------------------|----------------------------|--------------------------| | 1: business_&_entrepreneurs | 6: film_tv_&_video | 11: music | 16: sports | | 2: celebrity_&_pop_culture | 7: fitness_&_health | 12: news_&_social_concern | 17: travel_&_adventure | | 3: diaries_&_daily_life | 8: food_&_dining | 13: other_hobbies | 18: youth_&_student_life | | 4: family | 9: gaming | 14: relationships | | ## Full classification example ```python from transformers import AutoModelForSequenceClassification, TFAutoModelForSequenceClassification from transformers import AutoTokenizer import numpy as np from scipy.special import expit MODEL = f"cardiffnlp/tweet-topic-large-multilingual" tokenizer = AutoTokenizer.from_pretrained(MODEL) # PT model = AutoModelForSequenceClassification.from_pretrained(MODEL) class_mapping = model.config.id2label text = "It is great to see athletes promoting awareness for climate change." tokens = tokenizer(text, return_tensors='pt') output = model(**tokens) scores = output[0][0].detach().numpy() scores = expit(scores) predictions = (scores >= 0.5) * 1 # TF #tf_model = TFAutoModelForSequenceClassification.from_pretrained(MODEL) #class_mapping = tf_model.config.id2label #text = "It is great to see athletes promoting awareness for climate change." #tokens = tokenizer(text, return_tensors='tf') #output = tf_model(**tokens) #scores = output[0][0] #scores = expit(scores) #predictions = (scores >= 0.5) * 1 # Map to classes for i in range(len(predictions)): if predictions[i]: print(class_mapping[i]) ``` Output: ``` news_&_social_concern sports ``` ## Results on X-Topic | | English | Spanish | Japanese | Greek | |--------------|---------|---------|----------|-------| | **Macro-F1** | 60.2 | 52.9 | 57.3 | 50.3 | | **Micro-F1** | 66.3 | 67.0 | 61.4 | 73.0 | ## BibTeX entry and citation info TBA