CoronaCentral BERT Model for Topic / Article Type Classification

This is the topic / article type multi-label classification for the CoronaCentral website. This forms part of the pipeline for downloading and processing coronavirus literature described in the corona-ml repo with available step-by-step descriptions. The method is described in the preprint and detailed performance results can be found in the machine learning details document.

This model was derived by fine-tuning the microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract model on this coronavirus sequence (document) classification task.


Below are two Google Colab notebooks with example usage of this sequence classification model using HuggingFace transformers and KTrain.

Training Data

The model is trained on ~3200 manually-curated articles sampled at various stages during the coronavirus pandemic. The code for training is available in the category_prediction directory of the main Github Repo. The data is available in the annotated_documents.json.gz file.

Inputs and Outputs

The model takes in a tokenized title and abstract (combined into a single string and separated by a new line). The outputs are topics and article types, broadly called categories in the pipeline code. The types are listed below. Some others are managed by hand-coded rules described in the step-by-step descriptions.

List of Article Types

  • Comment/Editorial
  • Meta-analysis
  • News
  • Review

List of Topics

  • Clinical Reports
  • Communication
  • Contact Tracing
  • Diagnostics
  • Drug Targets
  • Education
  • Effect on Medical Specialties
  • Forecasting & Modelling
  • Health Policy
  • Healthcare Workers
  • Imaging
  • Immunology
  • Inequality
  • Infection Reports
  • Long Haul
  • Medical Devices
  • Misinformation
  • Model Systems & Tools
  • Molecular Biology
  • Non-human
  • Non-medical
  • Pediatrics
  • Prevalence
  • Prevention
  • Psychology
  • Recommendations
  • Risk Factors
  • Surveillance
  • Therapeutics
  • Transmission
  • Vaccines
Downloads last month
Hosted inference API
Text Classification
This model can be loaded on the Inference API on-demand.

Datasets used to train jakelever/coronabert