bertweet-covid19-base-uncased-pretraining-covid-vaccine-tweets

This model is a fine-tuned version of justinqbui/bertweet-covid19-base-uncased-pretraining-covid-vaccine-tweets which was finetuned by using this google fact check ~3k dataset size and webscraped data from polifact covid info ~ 1200 dataset size and ~1200 tweets pulled from the CDC with tweets containing the words covid or vaccine. It achieves the following results on the evaluation set (20% from the dataset randomly shuffled and selected to serve as a test set):

Validation Loss: 0.267367
Accuracy: 91.1370%

To use the model, use the inference API.

Alternatively, to run locally

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("justinqbui/bertweet-covid-vaccine-tweets-finetuned")

model = AutoModelForSequenceClassification.from_pretrained("justinqbui/bertweet-covid-vaccine-tweets-finetuned")

Model description

This model is a fine-tuned version of pretrained version justinqbui/bertweet-covid19-base-uncased-pretraining-covid-vaccine-tweets. Click on this to see how the pre-training was done.

This model was fine-tuned with a dataset of ~5500. A web scraper was used to scrape polifact and a script was used to pull from the google fact check API. Because ~80% of both these datasets were either false or misleading, I pulled about ~1200 tweets from the CDC related to covid and labelled them as true. ~30% of this dataset is considered true and the rest false or misleading. Please see the published datasets above for more detailed information.

The tokenizer requires the emoji library to be installed.

!pip install nltk emoji

Intended uses & limitations

The intended use of this model is to detect if the contents of a covid tweet is potentially false or misleading. This model is not an end all be all. It has many limitations. For example, if someone makes a post containing an image, but has attached a satirical image, this model would not be able to distinguish this. If a user links a website, the tokenizer allocates a special token for links, meaning the contents of the linked website is completely lost. If someone tweets a reply, this model can't look at the parent tweets, and will lack context.

This model's dataset relies on the crowd-sourcing annotations being accurate. This data is only accurate of up until early December 2021. For example, it probably wouldn't do very ell with tweets regarded the new omicron variant.

Example true inputs:

Covid vaccines are safe and effective. -> 97% true
Vaccinations are safe and help prevent covid. -> 97% true

Example false inputs:

Covid vaccines will kill you. -> 97% false
covid vaccines make you infertile. -> 97% false

Training and evaluation data

This model was finetuned by using this google fact check ~3k dataset size and webscraped data from polifact covid info ~ 1200 dataset size and ~1200 tweets pulled from the CDC with tweets containing the words covid or vaccine.

Training hyperparameters

The following hyperparameters were used during training: - learning_rate: 2e-5 - train_batch_size: 128 - eval_batch_size: 128 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - num_epochs: 3.0

Training results

Training Loss	Epoch	Validation Loss	Accuracy
0.435500	1.0	0.401900	0.906893
0.309700	2.0	0.265500	0.907789
0.266200	3.0	0.216500	0.911370

Framework versions

Transformers 4.13.0
Pytorch 1.10.0+cu111
Datasets 1.16.1
Tokenizers 0.10.3

justinqbui
/

bertweet-covid-vaccine-tweets-finetuned