PHS-BERT / README.md
publichealthsurveillance's picture
Update README.md
863b4b4

PHS-BERT

We present and release PHS-BERT, a transformer-based pretrained language model (PLM), to identify tasks related to public health surveillance (PHS) on social media. Compared with existing PLMs that are mainly evaluated on limited tasks, PHS-BERT achieved state-of-the-art performance on 25 tested datasets, showing that our PLM is robust and generalizable in common PHS tasks.

Usage

Load the model via Huggingface's Transformers library:

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("publichealthsurveillance/PHS-BERT")
model = AutoModel.from_pretrained("publichealthsurveillance/PHS-BERT")

Training Procedure

Pretraining

We followed the standard pretraining protocols of BERT and initialized PHS-BERT with weights from BERT during the training phase instead of training from scratch and used the uncased version of the BERT model.

PHS-BERT is trained on a corpus of health-related tweets that were crawled via the Twitter API. Focusing on the tasks related to PHS, keywords used to collect pretraining corpus are set to disease, symptom, vaccine, and mental health-related words in English. Retweet tags were deleted from the raw corpus, and URLs and usernames were replaced with HTTP-URL and @USER, respectively. All emoticons were replaced with their associated meanings.

Each sequence of BERT LM inputs is converted to 50,265 vocabulary tokens. Twitter posts are restricted to 200 characters, and during the training and evaluation phase, we used a batch size of 8. Distributed training was performed on a TPU v3-8.

Fine-tuning

We used the embedding of the special token [CLS] of the last hidden layer as the final feature of the input text. We adopted the multilayer perceptron (MLP) with the hyperbolic tangent activation function and used Adam optimizer. The models are trained with a one cycle policy at a maximum learning rate of 2e-05 with momentum cycled between 0.85 and 0.95.

Societal Impact

We train and release a PLM to accelerate the automatic identification of tasks related to PHS on social media. Our work aims to develop a new computational method for screening users in need of early intervention and is not intended to use in clinical settings or as a diagnostic tool.

BibTex entry and citation info

For more details, refer to the paper Benchmarking for Public Health Surveillance tasks on Social Media with a Domain-Specific Pretrained Language Model.

@inproceedings{naseem-etal-2022-benchmarking,
    title = "Benchmarking for Public Health Surveillance tasks on Social Media with a Domain-Specific Pretrained Language Model",
    author = "Naseem, Usman  and
      Lee, Byoung Chan  and
      Khushi, Matloob  and
      Kim, Jinman  and
      Dunn, Adam",
    booktitle = "Proceedings of NLP Power! The First Workshop on Efficient Benchmarking in NLP",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.nlppower-1.3",
    doi = "10.18653/v1/2022.nlppower-1.3",
    pages = "22--31",
    abstract = "A user-generated text on social media enables health workers to keep track of information, identify possible outbreaks, forecast disease trends, monitor emergency cases, and ascertain disease awareness and response to official health correspondence. This exchange of health information on social media has been regarded as an attempt to enhance public health surveillance (PHS). Despite its potential, the technology is still in its early stages and is not ready for widespread application. Advancements in pretrained language models (PLMs) have facilitated the development of several domain-specific PLMs and a variety of downstream applications. However, there are no PLMs for social media tasks involving PHS. We present and release PHS-BERT, a transformer-based PLM, to identify tasks related to public health surveillance on social media. We compared and benchmarked the performance of PHS-BERT on 25 datasets from different social medial platforms related to 7 different PHS tasks. Compared with existing PLMs that are mainly evaluated on limited tasks, PHS-BERT achieved state-of-the-art performance on all 25 tested datasets, showing that our PLM is robust and generalizable in the common PHS tasks. By making PHS-BERT available, we aim to facilitate the community to reduce the computational cost and introduce new baselines for future works across various PHS-related tasks.",
}