File size: 2,933 Bytes
d9d1bc0 d0fd8b9 d9d1bc0 97cfbac 7720a03 d9d1bc0 cdd12dd d9d1bc0 cdd12dd d9d1bc0 9e7be3b d9d1bc0 9e7be3b 3addc29 9e7be3b 3addc29 9e7be3b cdd12dd 9e7be3b cdd12dd 9e7be3b d9d1bc0 3addc29 cdd12dd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
---
language: en
license: apache-2.0
datasets:
- tweets
widget:
- text: "COVID-19 vaccines are safe and effective."
---
# Disclaimer: This page is under maintenance. Please DO NOT refer to the information on this page to make any decision yet.
# Vaccinating COVID tweets
A fine-tuned model for fact-classification task on English tweets about COVID-19/vaccine.
## Intended uses & limitations
#### How to use
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("ans/vaccinating-covid-tweets")
model = AutoModelForSequenceClassification.from_pretrained("ans/vaccinating-covid-tweets")
```
#### Limitations and bias
Provide examples of latent issues and potential remediations.
## Training data & Procedure
#### Pre-trained baseline model
- Pre-trained model: [BERTweet](https://github.com/VinAIResearch/BERTweet)
- trained based on the RoBERTa pre-training procedure
- 850M General English Tweets (Jan 2012 to Aug 2019)
- 23M COVID-19 English Tweets
- Size of the model: >134M parameters
- Further training
- Pre-training with recent COVID-19/vaccine tweets and fine-tuning for fact classification
#### 1) Pre-training language model
- The model was pre-trained on COVID-19/vaccined related tweets using a masked language modeling (MLM) objective starting from BERTweet
- Following datasets on English tweets were used:
- Tweets with trending #CovidVaccine hashtag, 207,000 tweets uploaded across Aug 2020 to Apr 2021 ([kaggle](https://www.kaggle.com/kaushiksuresh147/covidvaccine-tweets))
- Tweets about all COVID-19 vaccines, 78,000 tweets uploaded across Dec 2020 to May 2021 ([kaggle](https://www.kaggle.com/gpreda/all-covid19-vaccines-tweets))
- COVID-19 Twitter chatter dataset, 590,000 tweets uploaded across Mar 2021 to May 2021 ([github](https://github.com/thepanacealab/covid19_twitter))
#### 2) Fine-tuning for fact classification
- A fine-tuned model on English tweets using a masked language modeling (MLM) objective from [BERTweet](https://github.com/VinAIResearch/BERTweet) for fact-classification task on COVID-19/vaccine.
- Statements from Poynter and Snopes with Selenium 14,000 fact-checked statements from Jan 2020 to May 2021
- Divide original labels within 3 categories
- False: false, no evidence, manipulated, fake, not true, unproven, unverified
- Misleading: misleading, exaggerated, out of context, needs context
- True: true, correct
## Eval results
# Contributors
- This page is a part of final team project from MLDL for DS class at SNU
- Team BIBI - Vaccinating COVID-NineTweets
- Team members: Ahn, Hyunju; An, Jiyong; An, Seungchan; Jeong, Seokho; Kim, Jungmin; Kim, Sangbeom
- Advisor: Prof. Wen-Syan Li
# ![GSDS](https://gsds.snu.ac.kr/sites/gsds.snu.ac.kr/files/GSDS_logo.png)
<img src="https://gsds.snu.ac.kr/sites/gsds.snu.ac.kr/files/GSDS_logo.png" width="300" height="100">
|