Edit model card

cardiffnlp/twitter-roberta-base-2019-90m-tweet-topic-multi-2020

This model is a fine-tuned version of cardiffnlp/twitter-roberta-base-2019-90m on the tweet_topic_multi. This model is fine-tuned on train_2020 split and validated on test_2021 split of tweet_topic. Fine-tuning script can be found here. It achieves the following results on the test_2021 set:

  • F1 (micro): 0.7367104440275171
  • F1 (macro): 0.5656244617373364
  • Accuracy: 0.5134008338296605

Usage

import math
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

def sigmoid(x):
  return 1 / (1 + math.exp(-x))
  
tokenizer = AutoTokenizer.from_pretrained("cardiffnlp/twitter-roberta-base-2019-90m-tweet-topic-multi-2020")
model = AutoModelForSequenceClassification.from_pretrained("cardiffnlp/twitter-roberta-base-2019-90m-tweet-topic-multi-2020", problem_type="multi_label_classification")
model.eval()
class_mapping = model.config.id2label

with torch.no_grad():
  text = #NewVideo Cray Dollas- Water- Ft. Charlie Rose- (Official Music Video)- {{URL}} via {@YouTube@} #watchandlearn {{USERNAME}}
  tokens = tokenizer(text, return_tensors='pt')
  output = model(**tokens)
  flags = [sigmoid(s) > 0.5 for s in output[0][0].detach().tolist()]
  topic = [class_mapping[n] for n, i in enumerate(flags) if i]
print(topic)

Reference


@inproceedings{dimosthenis-etal-2022-twitter,
    title = "{T}witter {T}opic {C}lassification",
    author = "Antypas, Dimosthenis  and
    Ushio, Asahi  and
    Camacho-Collados, Jose  and
    Neves, Leonardo  and
    Silva, Vitor  and
    Barbieri, Francesco",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics"
}
Downloads last month
2
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train cardiffnlp/twitter-roberta-base-2019-90m-tweet-topic-multi-2020

Evaluation results