metadata

datasets:
  - cardiffnlp/tweet_topic_multi
metrics:
  - f1
  - accuracy
model-index:
  - name: cardiffnlp/roberta-large-tweet-topic-multi-2020
    results:
      - task:
          type: text-classification
          name: Text Classification
        dataset:
          name: cardiffnlp/tweet_topic_multi
          type: cardiffnlp/tweet_topic_multi
          args: cardiffnlp/tweet_topic_multi
          split: test_2021
        metrics:
          - name: F1
            type: f1
            value: 0.7323655694132079
          - name: F1 (macro)
            type: f1_macro
            value: 0.5794562917377284
          - name: Accuracy
            type: accuracy
            value: 0.4937462775461584
pipeline_tag: text-classification
widget:
  - text: >-
      I'm sure the {@Tampa Bay Lightning@} would’ve rather faced the Flyers but
      man does their experience versus the Blue Jackets this year and last help
      them a lot versus this Islanders team. Another meat grinder upcoming for
      the good guys
    example_title: Example 1
  - text: >-
      Love to take night time bike rides at the jersey shore. Seaside Heights
      boardwalk. Beautiful weather. Wishing everyone a safe Labor Day weekend in
      the US.
    example_title: Example 2

cardiffnlp/roberta-large-tweet-topic-multi-2020

This model is a fine-tuned version of roberta-large on the tweet_topic_multi. This model is fine-tuned on train_2020 split and validated on test_2021 split of tweet_topic. Fine-tuning script can be found here. It achieves the following results on the test_2021 set:

F1 (micro): 0.7323655694132079
F1 (macro): 0.5794562917377284
Accuracy: 0.4937462775461584

Usage

import math
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

def sigmoid(x):
  return 1 / (1 + math.exp(-x))
  
tokenizer = AutoTokenizer.from_pretrained("cardiffnlp/roberta-large-tweet-topic-multi-2020")
model = AutoModelForSequenceClassification.from_pretrained("cardiffnlp/roberta-large-tweet-topic-multi-2020", problem_type="multi_label_classification")
model.eval()
class_mapping = model.config.id2label

with torch.no_grad():
  text = #NewVideo Cray Dollas- Water- Ft. Charlie Rose- (Official Music Video)- {{URL}} via {@YouTube@} #watchandlearn {{USERNAME}}
  tokens = tokenizer(text, return_tensors='pt')
  output = model(**tokens)
  flags = [sigmoid(s) > 0.5 for s in output[0][0].detach().tolist()]
  topic = [class_mapping[n] for n, i in enumerate(flags) if i]
print(topic)

Reference


@inproceedings{dimosthenis-etal-2022-twitter,
    title = "{T}witter {T}opic {C}lassification",
    author = "Antypas, Dimosthenis  and
    Ushio, Asahi  and
    Camacho-Collados, Jose  and
    Neves, Leonardo  and
    Silva, Vitor  and
    Barbieri, Francesco",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics"
}