ans commited on
Commit
9e7be3b
1 Parent(s): 3addc29

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -24
README.md CHANGED
@@ -1,5 +1,4 @@
1
  ---
2
-
3
  language: en
4
  license: apache-2.0
5
  datasets:
@@ -11,8 +10,7 @@ widget:
11
  # Disclaimer: This page is under maintenance. Please DO NOT refer to the information on this page to make any decision yet.
12
 
13
  # Vaccinating COVID tweets
14
-
15
- Fine-tuned model on English language using a masked language modeling (MLM) objective from BERTweet in [this repository](https://github.com/VinAIResearch/BERTweet) for the classification task for false/misleading information about COVID-19 vaccines.
16
 
17
  ## Intended uses & limitations
18
 
@@ -26,32 +24,28 @@ Fine-tuned model on English language using a masked language modeling (MLM) obje
26
 
27
  Provide examples of latent issues and potential remediations.
28
 
29
- ## Training data
30
-
31
- #### 1) Pre-training language model
32
- - Tweets with trending #CovidVaccine hashtag 207,000 tweets uploaded across 2020-08-18 ~ 2021-04-20 [3]
33
- - Tweets about all COVID-19 vaccines 78,000 tweets uploaded across 2020-12-20 ~ 2021-05-13 [4]
34
- - Covid-19 Twitter chatter dataset 590,000 tweets uploaded across 2021-03-01 ~ 2021-05-20 [5]
35
-
36
- #### 2) Fine-tuning for fact classification
37
- - Statements from Poynter and Snopes with Selenium 14,000 fact-checked statements from 2020-01-14 to 2021-05-13
38
- - Divide original labels within 3 categories
39
- False: \\\\\\\\t\\\\\\\\tFalse, no evidence, manipulated, fake, not true, unproven, unverified
40
- Misleading: \\\\\\\\tMisleading, exaggerated, out of context, needs context
41
- True:\\\\\\\\t\\\\\\\\tTrue, correct
42
-
43
- Describe the data you used to train the model.
44
- If you initialized it with pre-trained weights, add a link to the pre-trained model card or repository with description of the pre-training data.
45
-
46
- ## Training procedure
47
 
48
- - Baseline model: [BERTweet](https://github.com/VinAIResearch/BERTweet)
 
49
  - trained based on the RoBERTa pre-training procedure
50
- - 850M General English Tweets (Jan 2012 ~ Aug 2019)
51
  - 23M COVID-19 English Tweets
52
  - Size of the model: >134M parameters
53
  - Further training
54
- - Training with recent COVID-19 and vaccine tweets
 
 
 
 
 
 
 
 
 
 
 
 
55
 
56
  ## Eval results
57
 
 
1
  ---
 
2
  language: en
3
  license: apache-2.0
4
  datasets:
 
10
  # Disclaimer: This page is under maintenance. Please DO NOT refer to the information on this page to make any decision yet.
11
 
12
  # Vaccinating COVID tweets
13
+ Fine-tuned model on English language using a masked language modeling (MLM) objective from BERTweet in [this repository](https://github.com/VinAIResearch/BERTweet) for the classification task for factual information about COVID-19/vaccine.
 
14
 
15
  ## Intended uses & limitations
16
 
 
24
 
25
  Provide examples of latent issues and potential remediations.
26
 
27
+ ## Training data & Procedure
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
+ #### Pre-trained baseline model
30
+ - Pre-trained model: [BERTweet](https://github.com/VinAIResearch/BERTweet)
31
  - trained based on the RoBERTa pre-training procedure
32
+ - 850M General English Tweets (Jan 2012 to Aug 2019)
33
  - 23M COVID-19 English Tweets
34
  - Size of the model: >134M parameters
35
  - Further training
36
+ - Pre-training with recent COVID-19/vaccine tweets and fine-tuning for fact classification
37
+
38
+ #### 1) Pre-training language model
39
+ - Tweets with trending #CovidVaccine hashtag, 207,000 tweets uploaded across Aug 2020 to Apr 2021 [kaggle](https://www.kaggle.com/kaushiksuresh147/covidvaccine-tweets)
40
+ - Tweets about all COVID-19 vaccines, 78,000 tweets uploaded across Dec 2020 to May 2021 [kaggle](https://www.kaggle.com/gpreda/all-covid19-vaccines-tweets)
41
+ - COVID-19 Twitter chatter dataset, 590,000 tweets uploaded across Mar 2021 to May 2021 [github](https://github.com/thepanacealab/covid19_twitter)
42
+
43
+ #### 2) Fine-tuning for fact classification
44
+ - Statements from Poynter and Snopes with Selenium 14,000 fact-checked statements from Jan 2020 to May 2021
45
+ - Divide original labels within 3 categories
46
+ - False: false, no evidence, manipulated, fake, not true, unproven, unverified
47
+ - Misleading: misleading, exaggerated, out of context, needs context
48
+ - True: true, correct
49
 
50
  ## Eval results
51