IMPORTANT

>>> This is an outdated model, please see my space for a more updated version. <<<

Background--

This model was built on Microsoft's BERT trained on PubMed uncased database (microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext). A number of (~500) radiology reports for staging nasopharyngeal carcinoma (NPC) written in our center by board-certified radiologist were retrospectively retrieved with ethics approval . To focus on NPC, incidental findings and unrelated observations are removed prior to training. In addition, the abbreviations for structures were replaced by the original words to facilitate the model of learning suffixes and prefixes that might indicate geographical locations (e.g. L neck -> left neck, IJC -> internal jugular chain).

A tokenizer was trained based on the original PubMed version, and the radiology reports were used to fine tune the PubMedBert. This fine tuned model has the weakness of unable to identify phrase or multi-word nouns, e.g. "nodal metastatases" is considered two separate words such that the BERT module tends to fill "nodes" when these two words are masked.

This model serve as a pilot analysis of whether it is possible to adopt a transformer based deep learning for radiology report corpus of NPC.

Affiliations

Imaging and Interventional Radiology,

Chinese University of Hong Kong