ans commited on
Commit
d9d1bc0
1 Parent(s): caaf64f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +232 -0
README.md CHANGED
@@ -0,0 +1,232 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+
3
+ language: en
4
+ tags:
5
+ - text-classifciation
6
+ license: apache-2.0
7
+ datasets:
8
+ - tweets
9
+ widget:
10
+ - text: "Vaccine is effective"
11
+ ---
12
+
13
+ # Vaccinating COVID tweets
14
+ - A part of MDLD for DS class at SNU
15
+
16
+ Fine-tuned model on English language using a masked language modeling (MLM) objective from BERTweet in [this repository](https://github.com/VinAIResearch/BERTweet) for the classification task for false/misleading information about COVID-19 vaccines.
17
+
18
+ # Vaccinating COVID tweets
19
+
20
+ Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in
21
+
22
+ [this paper](https://arxiv.org/abs/1810.04805) and first released in
23
+
24
+ [this repository](https://github.com/google-research/bert). This model is uncased: it does not make a difference
25
+
26
+ between english and English.
27
+
28
+ ## Model description
29
+
30
+ You can embed local or remote images using `![](...)`
31
+
32
+ ## Intended uses & limitations
33
+
34
+ #### How to use
35
+
36
+ ```python
37
+ # You can include sample code which will be formatted
38
+ ```
39
+
40
+ #### Limitations and bias
41
+
42
+ Provide examples of latent issues and potential remediations.
43
+
44
+ ## Training data
45
+
46
+ Describe the data you used to train the model.
47
+ If you initialized it with pre-trained weights, add a link to the pre-trained model card or repository with description of the pre-training data.
48
+
49
+ ## Training procedure
50
+
51
+ Preprocessing, hardware used, hyperparameters...
52
+
53
+ ## Eval results
54
+
55
+ ### BibTeX entry and citation info
56
+
57
+ ```bibtex
58
+ @inproceedings{...,
59
+ year={2020}
60
+ }
61
+ ```
62
+ ------------------------
63
+
64
+ ## Intended uses & limitations
65
+
66
+ You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
67
+
68
+ be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=bert) to look for
69
+
70
+ fine-tuned versions on a task that interests you.
71
+
72
+ Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
73
+
74
+ to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
75
+
76
+ generation you should look at model like GPT2.
77
+
78
+ ### How to use
79
+
80
+ You can use this model directly with a pipeline for masked language modeling:
81
+
82
+ ```python
83
+
84
+ >>> from transformers import pipeline
85
+
86
+ >>> unmasker = pipeline('fill-mask', model='ans/vaccinating-covid-tweets')
87
+
88
+ >>> unmasker("Hello I'm a [MASK] model.")
89
+
90
+ [{'sequence': "[CLS] hello i'm a fashion model. [SEP]",
91
+
92
+ 'score': 0.1073106899857521,
93
+
94
+ 'token': 4827,
95
+
96
+ 'token_str': 'fashion'},
97
+
98
+ {'sequence': "[CLS] hello i'm a role model. [SEP]",
99
+
100
+ 'score': 0.08774490654468536,
101
+
102
+ 'token': 2535,
103
+
104
+ 'token_str': 'role'},
105
+
106
+ {'sequence': "[CLS] hello i'm a new model. [SEP]",
107
+
108
+ 'score': 0.05338378623127937,
109
+
110
+ 'token': 2047,
111
+
112
+ 'token_str': 'new'},
113
+
114
+ {'sequence': "[CLS] hello i'm a super model. [SEP]",
115
+
116
+ 'score': 0.04667217284440994,
117
+
118
+ 'token': 3565,
119
+
120
+ 'token_str': 'super'},
121
+
122
+ {'sequence': "[CLS] hello i'm a fine model. [SEP]",
123
+
124
+ 'score': 0.027095865458250046,
125
+
126
+ 'token': 2986,
127
+
128
+ 'token_str': 'fine'}]
129
+
130
+ ```
131
+
132
+ Here is how to use this model to get the features of a given text in PyTorch:
133
+
134
+ ```python
135
+
136
+ from transformers import BertTokenizer, BertModel
137
+
138
+ tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
139
+
140
+ model = BertModel.from_pretrained("bert-base-uncased")
141
+
142
+ text = "Replace me by any text you'd like."
143
+
144
+ encoded_input = tokenizer(text, return_tensors='pt')
145
+
146
+ output = model(**encoded_input)
147
+
148
+ ```
149
+
150
+
151
+ ### Limitations and bias
152
+
153
+ Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
154
+
155
+ This bias will also affect all fine-tuned versions of this model.
156
+
157
+
158
+ ## Training data
159
+
160
+ The BERT model was pretrained on [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038
161
+
162
+ unpublished books and [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and
163
+
164
+ headers).
165
+
166
+ ## Training procedure
167
+
168
+ ### Preprocessing
169
+
170
+ The texts are lowercased and tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are
171
+
172
+ then of the form:
173
+
174
+ ```
175
+
176
+ [CLS] Sentence A [SEP] Sentence B [SEP]
177
+
178
+ ```
179
+
180
+ With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus and in
181
+
182
+ the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a
183
+
184
+ consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two
185
+
186
+ "sentences" has a combined length of less than 512 tokens.
187
+
188
+ The details of the masking procedure for each sentence are the following:
189
+
190
+ - 15% of the tokens are masked.
191
+
192
+ - In 80% of the cases, the masked tokens are replaced by `[MASK]`.
193
+
194
+ - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
195
+
196
+ - In the 10% remaining cases, the masked tokens are left as is.
197
+
198
+ ### Pretraining
199
+
200
+ The model was trained on 4 cloud TPUs in Pod configuration (16 TPU chips total) for one million steps with a batch size
201
+
202
+ of 256. The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%. The optimizer
203
+
204
+ used is Adam with a learning rate of 1e-4, \\\\\\\\\\\\\\\\(\\\\\\\\beta_{1} = 0.9\\\\\\\\\\\\\\\\) and \\\\\\\\\\\\\\\\(\\\\\\\\beta_{2} = 0.999\\\\\\\\\\\\\\\\), a weight decay of 0.01,
205
+
206
+ learning rate warmup for 10,000 steps and linear decay of the learning rate after.
207
+
208
+ ## Evaluation results
209
+
210
+ When fine-tuned on downstream tasks, this model achieves the following results:
211
+
212
+ Glue test results:
213
+
214
+ | Task | MNLI-(m/mm) | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | Average |
215
+
216
+ |:----:|:-----------:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|:-------:|
217
+
218
+ | | 84.6/83.4 | 71.2 | 90.5 | 93.5 | 52.1 | 85.8 | 88.9 | 66.4 | 79.6 |
219
+
220
+ # Contributors
221
+ - Ahn, Hyunju
222
+ - An, Jiyong
223
+ - An, Seungchan
224
+ - Jeong, Seokho
225
+ - Kim, Jungmin
226
+ - Kim, Sangbeom
227
+ - Advisor: Dr. Wen-Syan Li
228
+
229
+ Disclaimer: The team releasing BERT did not write a model card for this model so this model card has been written by the Hugging Face team.
230
+
231
+
232
+