jakelever commited on
Commit
f848702
1 Parent(s): aae5fbc

Added details to README

Browse files
Files changed (1) hide show
  1. README.md +59 -0
README.md CHANGED
@@ -1,2 +1,61 @@
1
  # CoronaCentral BERT Model for Topic / Article Type Classification
2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # CoronaCentral BERT Model for Topic / Article Type Classification
2
 
3
+ This is the topic / article type classification for the [CoronaCentral website](https://coronacentral.ai). This forms part of the pipeline for downloading and processing coronavirus literature described in the [corona-ml repo](https://github.com/jakelever/corona-ml) with available [step-by-step descriptions](https://github.com/jakelever/corona-ml/blob/master/stepByStep.md). The method is described in the [preprint](https://doi.org/10.1101/2020.12.21.423860) and detailed performance results can be found in the [machine learning details](https://github.com/jakelever/corona-ml/blob/master/machineLearningDetails.md) document.
4
+
5
+ This is derived from the [microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract) model and fine-tuned for the sequence classification task.
6
+
7
+ ## Usage
8
+
9
+ Below are two Google Colab notebooks with example usage of this sequence classification model using HuggingFace transformers and KTrain.
10
+
11
+ - [HuggingFace example on Google Colab](https://colab.research.google.com/drive/1cBNgKd4o6FNWwjKXXQQsC_SaX1kOXDa4?usp=sharing)
12
+ - [KTrain example on Google Colab](https://colab.research.google.com/drive/1h7oJa2NDjnBEoox0D5vwXrxiCHj3B1kU?usp=sharing)
13
+
14
+ ## Training Data
15
+
16
+ The model is trained on ~3200 manually-curated articles sampled at various stages during the coronavirus pandemic. The code for training is available in the [category\_prediction](https://github.com/jakelever/corona-ml/tree/master/category_prediction) directory of the main Github Repo. The data is available in the [annotated_documents.json.gz](https://github.com/jakelever/corona-ml/blob/master/category_prediction/annotated_documents.json.gz) file.
17
+
18
+ ## Inputs and Outputs
19
+
20
+ The model takes in a tokenized title and abstract (combined into a single string and separated by a new line). The outputs are topics and article types, broadly called categories in the pipeline code. The types are listed below. Some others are managed by hand-coded rules described in the [step-by-step descriptions](https://github.com/jakelever/corona-ml/blob/master/stepByStep.md).
21
+
22
+ ### List of Article Types
23
+
24
+ - Comment/Editorial
25
+ - Meta-analysis
26
+ - News
27
+ - Review
28
+
29
+ ### List of Topics
30
+
31
+ - Clinical Reports
32
+ - Communication
33
+ - Contact Tracing
34
+ - Diagnostics
35
+ - Drug Targets
36
+ - Education
37
+ - Effect on Medical Specialties
38
+ - Forecasting & Modelling
39
+ - Health Policy
40
+ - Healthcare Workers
41
+ - Imaging
42
+ - Immunology
43
+ - Inequality
44
+ - Infection Reports
45
+ - Long Haul
46
+ - Medical Devices
47
+ - Misinformation
48
+ - Model Systems & Tools
49
+ - Molecular Biology
50
+ - Non-human
51
+ - Non-medical
52
+ - Pediatrics
53
+ - Prevalence
54
+ - Prevention
55
+ - Psychology
56
+ - Recommendations
57
+ - Risk Factors
58
+ - Surveillance
59
+ - Therapeutics
60
+ - Transmission
61
+ - Vaccines