File size: 2,543 Bytes
06a264e
 
 
 
 
 
032c5da
06a264e
 
 
 
 
972d33d
06a264e
a0bb0aa
 
06a264e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6740229
06a264e
 
6740229
06a264e
 
6740229
06a264e
 
 
6740229
06a264e
 
6740229
 
 
06a264e
 
 
 
6740229
06a264e
 
 
 
 
6740229
06a264e
 
 
 
972d33d
06a264e
 
 
 
 
 
972d33d
06a264e
972d33d
 
06a264e
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---
language: "en"
tags:
- twitter
- masked-token-prediction
- election2020
- politics
license: "gpl-3.0"
---

# Pre-trained BERT on Twitter US Political Election 2020

Pre-trained weights for [Knowledge Enhance Masked Language Model for Stance Detection](https://www.aclweb.org/anthology/2021.naacl-main.376), NAACL 2021.

We use the initialized weights from BERT-base (uncased) or `bert-base-uncased`.

# Training Data

This model is pre-trained on over 5 million English tweets about the 2020 US Presidential Election.

# Training Objective

This model is initialized with BERT-base and trained with normal MLM objective.

# Usage

This pre-trained language model **can be fine-tunned to any downstream task (e.g. classification)**.

Please see the [official repository](https://github.com/GU-DataLab/stance-detection-KE-MLM) for more detail.

```python
from transformers import BertTokenizer, BertForMaskedLM, pipeline
import torch

# Choose GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Select mode path here
pretrained_LM_path = "kornosk/bert-political-election2020-twitter-mlm"

# Load model
tokenizer = BertTokenizer.from_pretrained(pretrained_LM_path)
model = BertForMaskedLM.from_pretrained(pretrained_LM_path)

# Fill mask
example = "Trump is the [MASK] of USA"
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)
# Use following line instead of the above one does not work.
# Huggingface have been updated, newer version accepts a string of model name instead.
fill_mask = pipeline('fill-mask', model=pretrained_LM_path, tokenizer=tokenizer)

outputs = fill_mask(example)
print(outputs)

# See embeddings
inputs = tokenizer(example, return_tensors="pt")
outputs = model(**inputs)
print(outputs)

# OR you can use this model to train on your downstream task!
# Please consider citing our paper if you feel this is useful :)
```

# Reference

- [Knowledge Enhance Masked Language Model for Stance Detection](https://www.aclweb.org/anthology/2021.naacl-main.376), NAACL 2021.

# Citation
```bibtex
@inproceedings{kawintiranon2021knowledge,
    title={Knowledge Enhanced Masked Language Model for Stance Detection},
    author={Kawintiranon, Kornraphop and Singh, Lisa},
    booktitle={Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
    year={2021},
    publisher={Association for Computational Linguistics},
    url={https://www.aclweb.org/anthology/2021.naacl-main.376}
}
```