Update README.md
Browse files
README.md
CHANGED
@@ -1,5 +1,4 @@
|
|
1 |
---
|
2 |
-
|
3 |
language: en
|
4 |
license: apache-2.0
|
5 |
datasets:
|
@@ -11,8 +10,7 @@ widget:
|
|
11 |
# Disclaimer: This page is under maintenance. Please DO NOT refer to the information on this page to make any decision yet.
|
12 |
|
13 |
# Vaccinating COVID tweets
|
14 |
-
|
15 |
-
Fine-tuned model on English language using a masked language modeling (MLM) objective from BERTweet in [this repository](https://github.com/VinAIResearch/BERTweet) for the classification task for false/misleading information about COVID-19 vaccines.
|
16 |
|
17 |
## Intended uses & limitations
|
18 |
|
@@ -26,32 +24,28 @@ Fine-tuned model on English language using a masked language modeling (MLM) obje
|
|
26 |
|
27 |
Provide examples of latent issues and potential remediations.
|
28 |
|
29 |
-
## Training data
|
30 |
-
|
31 |
-
#### 1) Pre-training language model
|
32 |
-
- Tweets with trending #CovidVaccine hashtag 207,000 tweets uploaded across 2020-08-18 ~ 2021-04-20 [3]
|
33 |
-
- Tweets about all COVID-19 vaccines 78,000 tweets uploaded across 2020-12-20 ~ 2021-05-13 [4]
|
34 |
-
- Covid-19 Twitter chatter dataset 590,000 tweets uploaded across 2021-03-01 ~ 2021-05-20 [5]
|
35 |
-
|
36 |
-
#### 2) Fine-tuning for fact classification
|
37 |
-
- Statements from Poynter and Snopes with Selenium 14,000 fact-checked statements from 2020-01-14 to 2021-05-13
|
38 |
-
- Divide original labels within 3 categories
|
39 |
-
False: \\\\\\\\t\\\\\\\\tFalse, no evidence, manipulated, fake, not true, unproven, unverified
|
40 |
-
Misleading: \\\\\\\\tMisleading, exaggerated, out of context, needs context
|
41 |
-
True:\\\\\\\\t\\\\\\\\tTrue, correct
|
42 |
-
|
43 |
-
Describe the data you used to train the model.
|
44 |
-
If you initialized it with pre-trained weights, add a link to the pre-trained model card or repository with description of the pre-training data.
|
45 |
-
|
46 |
-
## Training procedure
|
47 |
|
48 |
-
-
|
|
|
49 |
- trained based on the RoBERTa pre-training procedure
|
50 |
-
- 850M General English Tweets (Jan 2012
|
51 |
- 23M COVID-19 English Tweets
|
52 |
- Size of the model: >134M parameters
|
53 |
- Further training
|
54 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
55 |
|
56 |
## Eval results
|
57 |
|
|
|
1 |
---
|
|
|
2 |
language: en
|
3 |
license: apache-2.0
|
4 |
datasets:
|
|
|
10 |
# Disclaimer: This page is under maintenance. Please DO NOT refer to the information on this page to make any decision yet.
|
11 |
|
12 |
# Vaccinating COVID tweets
|
13 |
+
Fine-tuned model on English language using a masked language modeling (MLM) objective from BERTweet in [this repository](https://github.com/VinAIResearch/BERTweet) for the classification task for factual information about COVID-19/vaccine.
|
|
|
14 |
|
15 |
## Intended uses & limitations
|
16 |
|
|
|
24 |
|
25 |
Provide examples of latent issues and potential remediations.
|
26 |
|
27 |
+
## Training data & Procedure
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
28 |
|
29 |
+
#### Pre-trained baseline model
|
30 |
+
- Pre-trained model: [BERTweet](https://github.com/VinAIResearch/BERTweet)
|
31 |
- trained based on the RoBERTa pre-training procedure
|
32 |
+
- 850M General English Tweets (Jan 2012 to Aug 2019)
|
33 |
- 23M COVID-19 English Tweets
|
34 |
- Size of the model: >134M parameters
|
35 |
- Further training
|
36 |
+
- Pre-training with recent COVID-19/vaccine tweets and fine-tuning for fact classification
|
37 |
+
|
38 |
+
#### 1) Pre-training language model
|
39 |
+
- Tweets with trending #CovidVaccine hashtag, 207,000 tweets uploaded across Aug 2020 to Apr 2021 [kaggle](https://www.kaggle.com/kaushiksuresh147/covidvaccine-tweets)
|
40 |
+
- Tweets about all COVID-19 vaccines, 78,000 tweets uploaded across Dec 2020 to May 2021 [kaggle](https://www.kaggle.com/gpreda/all-covid19-vaccines-tweets)
|
41 |
+
- COVID-19 Twitter chatter dataset, 590,000 tweets uploaded across Mar 2021 to May 2021 [github](https://github.com/thepanacealab/covid19_twitter)
|
42 |
+
|
43 |
+
#### 2) Fine-tuning for fact classification
|
44 |
+
- Statements from Poynter and Snopes with Selenium 14,000 fact-checked statements from Jan 2020 to May 2021
|
45 |
+
- Divide original labels within 3 categories
|
46 |
+
- False: false, no evidence, manipulated, fake, not true, unproven, unverified
|
47 |
+
- Misleading: misleading, exaggerated, out of context, needs context
|
48 |
+
- True: true, correct
|
49 |
|
50 |
## Eval results
|
51 |
|