File size: 4,394 Bytes
d9d1bc0
 
 
 
 
 
e7930f4
d9d1bc0
 
97cfbac
7720a03
d9d1bc0
cdd12dd
d9d1bc0
 
16bf77f
84dc378
d9d1bc0
 
84dc378
 
 
d9d1bc0
f61ade9
 
e7930f4
f61ade9
 
 
84dc378
 
7091235
 
 
 
 
 
 
 
 
 
 
 
 
 
6f33ebb
 
16bf77f
6f33ebb
 
 
621eb31
d9d1bc0
 
16bf77f
706655a
 
 
 
d9d1bc0
ddb1513
d9d1bc0
9e7be3b
d9d1bc0
9e7be3b
 
3addc29
9e7be3b
3addc29
 
 
9e7be3b
 
 
86dac4a
cdd12dd
 
 
 
 
9e7be3b
86dac4a
 
 
16bf77f
 
 
d9d1bc0
84dc378
 
 
 
d9d1bc0
 
86dac4a
3addc29
 
 
 
3661728
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
---
language: en
license: apache-2.0
datasets:
- tweets
widget:
- text: "Vaccines to prevent SARS-CoV-2 infection are considered the most promising approach for curbing the pandemic."
---

# Disclaimer: This page is under maintenance. Please DO NOT refer to the information on this page to make any decision yet.

# Vaccinating COVID tweets
A fine-tuned model for fact-classification task on English tweets about COVID-19/vaccine.

## Intended uses & limitations
You can classify if the input tweet (or any others statement) about COVID-19/vaccine is `true`, `false` or `misleading`.
Note that since this model was trained with data up to May 2020, the most recent information may not be reflected.

#### How to use
You can use this model directly on this page or using `transformers` in python.

- Load pipeline and implement with input sequence
```python
from transformers import pipeline
pipe = pipeline("sentiment-analysis", model = "ans/vaccinating-covid-tweets")
seq = "Vaccines to prevent SARS-CoV-2 infection are considered the most promising approach for curbing the pandemic."
pipe(seq)
```

- Expected output
```python
  [
    {
      "label": "false",
      "score": 0.07972867041826248
    },
    {
      "label": "misleading",
      "score": 0.019911376759409904
    },
    {
      "label": "true",
      "score": 0.9003599882125854
    }
  ]
```

- `true` examples
```python
"By the end of 2020, several vaccines had become available for use in different parts of the world."
"Vaccines to prevent SARS-CoV-2 infection are considered the most promising approach for curbing the pandemic."
"RNA vaccines were the first vaccines for SARS-CoV-2 to be produced and represent an entirely new vaccine approach."
```

- `false` examples
```python
"COVID-19 vaccine caused new strain in UK."
```

#### Limitations and bias
To conservatively classify whether an input sequence is true or not, the model may have predictions biased toward `false` or `misleading`.

## Training data & Procedure

#### Pre-trained baseline model
- Pre-trained model: [BERTweet](https://github.com/VinAIResearch/BERTweet)
  - trained based on the RoBERTa pre-training procedure
  - 850M General English Tweets (Jan 2012 to Aug 2019)
  - 23M COVID-19 English Tweets
  - Size of the model: >134M parameters
- Further training
  - Pre-training with recent COVID-19/vaccine tweets and fine-tuning for fact classification

#### 1) Pre-training language model
- The model was pre-trained on COVID-19/vaccined related tweets using a masked language modeling (MLM) objective starting from BERTweet.
- Following datasets on English tweets were used:
  - Tweets with trending #CovidVaccine hashtag, 207,000 tweets uploaded across Aug 2020 to Apr 2021 ([kaggle](https://www.kaggle.com/kaushiksuresh147/covidvaccine-tweets))
  - Tweets about all COVID-19 vaccines, 78,000 tweets uploaded across Dec 2020 to May 2021 ([kaggle](https://www.kaggle.com/gpreda/all-covid19-vaccines-tweets))
  - COVID-19 Twitter chatter dataset, 590,000 tweets uploaded across Mar 2021 to May 2021 ([github](https://github.com/thepanacealab/covid19_twitter))
  
#### 2) Fine-tuning for fact classification
- A fine-tuned model from pre-trained language model (1) for fact-classification task on COVID-19/vaccine.
- COVID-19/vaccine-related statements were collected from [Poynter](https://www.poynter.org/ifcn-covid-19-misinformation/) and [Snopes](https://www.snopes.com/) using Selenium resulting in over 14,000 fact-checked statements from Jan 2020 to May 2021.
- Original labels were divided within following three categories:
  - `False`: includes false, no evidence, manipulated, fake, not true, unproven and unverified
  - `Misleading`: includes misleading, exaggerated, out of context and needs context
  - `True`: includes true and correct

## Evaluation results
| Training loss | Validation loss | Training accuracy | Validation accuracy |
| --- | --- | --- | --- |
| 0.1062 | 0.1006 | 96.3% | 94.5% |

# Contributors
- This model is a part of final team project from MLDL for DS class at SNU.
  - Team BIBI - Vaccinating COVID-NineTweets
  - Team members: Ahn, Hyunju; An, Jiyong; An, Seungchan; Jeong, Seokho; Kim, Jungmin; Kim, Sangbeom
  - Advisor: Prof. Wen-Syan Li

<a href="https://gsds.snu.ac.kr/"><img src="https://gsds.snu.ac.kr/wp-content/uploads/sites/50/2021/04/GSDS_logo2-e1619068952717.png" width="200" height="80"></a>