metadata

license: mit
pipeline_tag: token-classification
tags:
  - BERT
  - bioBERT
  - NER
  - medical
metrics:
  - f1
language:
  - en

Model

NER-Model for disease/treatment/technology entity recognition. The purpose of the model/data use is educational.

The original dataset tags have been augmented with "inside"-Tags in order to handle sub-tokens produced by the WordPiece tokenizer. Following NER-tags are used:

B-DISEASE, I-DISEASE: begin and inside tags for disease
B-TREATMENT, I-TREATMENT: begin and inside tags for treatment
B-TECHNOLOGY, I-TECHNOLOGY: begin and inside tags for technology
O - outside entities (irrelevant)

# Text:
Acute obstructive hydrocephalus complicating bacterial meningitis in childhood

# Real:
Acute           -> DISEASE
obstructive     -> DISEASE
hydrocephalus   -> DISEASE
bacterial       -> DISEASE
meningitis      -> DISEASE

# Predictions:
o##bs##truct##ive     -> B-DISEASE + I-DISEASE + I-DISEASE + I-DISEASE
h##ydro##ce##pha##lus -> B-DISEASE + I-DISEASE + I-DISEASE + I-DISEASE + I-DISEASE
bacterial             -> B-DISEASE
men##ing##itis        -> B-DISEASE + I-DISEASE + I-DISEASE

Sources

This pipeline is based on the dmis-lab/biobert-base-cased-v1.2 pretrained model, fine-tuned using the relatively small BeHealthy Medical Entity dataset (1.550 training samples). The initial version of this model was then used to augment the medical technology dataset. Both datasets were then used to train this model.

Performance

The model has not been extensively tuned. The quality of the dataset is not clear, due to unknown origin of the data / annotation process.

Metric	Score
Precision	0.836892
Recall	0.766610
F1	0.800211
Accuracy	0.935253