Disclaimer: This page is under maintenance. Please DO NOT refer to the information on this page to make any decision yet.

Vaccinating COVID tweets

A fine-tuned model for fact-classification task on English tweets about COVID-19/vaccine.

Intended uses & limitations

You can classify if the input tweet (or any others statement) about COVID-19/vaccine is true, false or misleading. Note that since this model was trained with data up to May 2020, the most recent information may not be reflected.

How to use

You can use this model directly on this page or using transformers in python.

Load pipeline and implement with input sequence

from transformers import pipeline
pipe = pipeline("sentiment-analysis", model = "ans/vaccinating-covid-tweets")
seq = "Vaccines to prevent SARS-CoV-2 infection are considered the most promising approach for curbing the pandemic."
pipe(seq)

Expected output

  [
    {
      "label": "false",
      "score": 0.07972867041826248
    },
    {
      "label": "misleading",
      "score": 0.019911376759409904
    },
    {
      "label": "true",
      "score": 0.9003599882125854
    }
  ]

true examples

"By the end of 2020, several vaccines had become available for use in different parts of the world."
"Vaccines to prevent SARS-CoV-2 infection are considered the most promising approach for curbing the pandemic."
"RNA vaccines were the first vaccines for SARS-CoV-2 to be produced and represent an entirely new vaccine approach."

false examples

"COVID-19 vaccine caused new strain in UK."

Limitations and bias

To conservatively classify whether an input sequence is true or not, the model may have predictions biased toward false or misleading.

Training data & Procedure

Pre-trained baseline model

Pre-trained model: BERTweet
- trained based on the RoBERTa pre-training procedure
- 850M General English Tweets (Jan 2012 to Aug 2019)
- 23M COVID-19 English Tweets
- Size of the model: >134M parameters
Further training
- Pre-training with recent COVID-19/vaccine tweets and fine-tuning for fact classification

1) Pre-training language model

The model was pre-trained on COVID-19/vaccined related tweets using a masked language modeling (MLM) objective starting from BERTweet.
Following datasets on English tweets were used:
- Tweets with trending #CovidVaccine hashtag, 207,000 tweets uploaded across Aug 2020 to Apr 2021 (kaggle)
- Tweets about all COVID-19 vaccines, 78,000 tweets uploaded across Dec 2020 to May 2021 (kaggle)
- COVID-19 Twitter chatter dataset, 590,000 tweets uploaded across Mar 2021 to May 2021 (github)

2) Fine-tuning for fact classification

A fine-tuned model from pre-trained language model (1) for fact-classification task on COVID-19/vaccine.
COVID-19/vaccine-related statements were collected from Poynter and Snopes using Selenium resulting in over 14,000 fact-checked statements from Jan 2020 to May 2021.
Original labels were divided within following three categories:
- False: includes false, no evidence, manipulated, fake, not true, unproven and unverified
- Misleading: includes misleading, exaggerated, out of context and needs context
- True: includes true and correct

Evaluation results

Training loss	Validation loss	Training accuracy	Validation accuracy
0.1062	0.1006	96.3%	94.5%

Contributors

This model is a part of final team project from MLDL for DS class at SNU.
- Team BIBI - Vaccinating COVID-NineTweets
- Team members: Ahn, Hyunju; An, Jiyong; An, Seungchan; Jeong, Seokho; Kim, Jungmin; Kim, Sangbeom
- Advisor: Prof. Wen-Syan Li