metadata
language: en
license: apache-2.0
datasets:
- tweets
widget:
- text: >-
Vaccines to prevent SARS-CoV-2 infection are considered the most promising
approach for curbing the pandemic.
Disclaimer: This page is under maintenance. Please DO NOT refer to the information on this page to make any decision yet.
Vaccinating COVID tweets
A fine-tuned model for fact-classification task on English tweets about COVID-19/vaccine.
Intended uses & limitations
You can classify if the input tweet (or any others statement) about COVID-19/vaccine is true
, false
or misleading
.
Note that since this model was trained with data up to May 2020, the most recent information may not be reflected.
How to use
You can use this model directly on this page or using transformers
in python.
- Load pipeline and implement with input sequence
from transformers import pipeline
pipe = pipeline("sentiment-analysis", model = "ans/vaccinating-covid-tweets")
seq = "Vaccines to prevent SARS-CoV-2 infection are considered the most promising approach for curbing the pandemic."
pipe(seq)
- Expected output
[
{
"label": "false",
"score": 0.07972867041826248
},
{
"label": "misleading",
"score": 0.019911376759409904
},
{
"label": "true",
"score": 0.9003599882125854
}
]
true
examples
"By the end of 2020, several vaccines had become available for use in different parts of the world."
"Vaccines to prevent SARS-CoV-2 infection are considered the most promising approach for curbing the pandemic."
"RNA vaccines were the first vaccines for SARS-CoV-2 to be produced and represent an entirely new vaccine approach."
false
examples
"COVID-19 vaccine caused new strain in UK."
Limitations and bias
To conservatively classify whether an input sequence is true or not, the model may have predictions biased toward false
or misleading
.
Training data & Procedure
Pre-trained baseline model
- Pre-trained model: BERTweet
- trained based on the RoBERTa pre-training procedure
- 850M General English Tweets (Jan 2012 to Aug 2019)
- 23M COVID-19 English Tweets
- Size of the model: >134M parameters
- Further training
- Pre-training with recent COVID-19/vaccine tweets and fine-tuning for fact classification
1) Pre-training language model
- The model was pre-trained on COVID-19/vaccined related tweets using a masked language modeling (MLM) objective starting from BERTweet.
- Following datasets on English tweets were used:
2) Fine-tuning for fact classification
- A fine-tuned model from pre-trained language model (1) for fact-classification task on COVID-19/vaccine.
- COVID-19/vaccine-related statements were collected from Poynter and Snopes using Selenium resulting in over 14,000 fact-checked statements from Jan 2020 to May 2021.
- Original labels were divided within following three categories:
False
: includes false, no evidence, manipulated, fake, not true, unproven and unverifiedMisleading
: includes misleading, exaggerated, out of context and needs contextTrue
: includes true and correct
Evaluation results
Training loss | Validation loss | Training accuracy | Validation accuracy |
---|---|---|---|
0.1062 | 0.1006 | 96.3% | 94.5% |
Contributors
- This model is a part of final team project from MLDL for DS class at SNU.
- Team BIBI - Vaccinating COVID-NineTweets
- Team members: Ahn, Hyunju; An, Jiyong; An, Seungchan; Jeong, Seokho; Kim, Jungmin; Kim, Sangbeom
- Advisor: Prof. Wen-Syan Li