PathologyBERT - Masked Language Model with Breast Pathology Specimens.

Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. Recently, several studies have explored the utility and efficacy of contextual models in the clinical, medical, and biomedical domains (BioBERT, ClinicalBERT, SciBERT, BlueBERT However, while there is a growing interest in developing language models for more specific domains, the current trend appears to prefer re-training general-domain models on specialized corpora rather than developing models from the ground up with specialized vocabulary. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. However, in fields requiring specialized terminology, such as pathology, these models often fail to perform adequately. One of the major reasons for this limitation is because BERT employs Word-Pieces for unsupervised input tokenization, a technique that relies on a predetermined set of Word-Pieces. The vocabulary is built such that it contains the most commonly used words or subword units and as a result, any new words can be represented by frequent subwords. Although WordPiece was built to handle suffixes and complex compound words, it often fails with domain-specific terms. For example, while ClinicalBERT successfully tokenizes the word 'endpoint' as ['end', '##point'], it tokenize the word 'carcinoma' as ['car', '##cin', '##oma'] in which the word lost its actual meaning and replaced by some non-relevant junk words, such as `car'. The words which was replaced by the junk pieces, may not play the original role in deriving the contextual representation of the sentence or the paragraph, even when analyzed by the powerful transformer models.

To facilitate research on language representations in the pathology domain and assist researchers in addressing the current limitations and advancing cancer research, we preset PathologyBERT, a pre-trained masked language model trained on Histopathology Specimens Reports.

Pretraining Hyperparameters

We used a batch size of 32, a maximum sequence length of 64 (mean size of report is 42±26), masked language model probability of 0.15, and a learning rate of 2e-5 for pre-training the Language Model . The model was trained for 300,000 steps. All other BERT default parameters were used.

How to use

You can use this model directly with a pipeline for masked language modeling:

>>> from transformers import pipeline
>>> language_model = pipeline('fill-mask', model='tsantos/PathologyBERT')
>>> language_model("intraductal papilloma with [MASK] AND MICRO calcifications")

[{'sequence': '[CLS] intraductal papilloma with sclerosis and micro calcifications [SEP]',
  'score': 0.871,
  'token': 2364,
  'token_str': 'sclerosis'},
 {'sequence': '[CLS] intraductal papilloma with hyalinization and micro calcifications [SEP]',
  'score': 0.032,
  'token': 4046,
  'token_str': 'hyalinization'},
 {'sequence': '[CLS] intraductal papilloma with atypia and micro calcifications [SEP]',
  'score': 0.013,
  'token': 652,
  'token_str': 'atypia'},
 {'sequence': '[CLS] intraductal papilloma with sclerosing and micro calcifications [SEP]',
  'score': 0.006,
  'token': 923,
  'token_str': 'sclerosing'},
 {'sequence': '[CLS] intraductal papilloma with calcifications and micro calcifications [SEP]',
  'score': 0.004,
  'token': 614,
  'token_str': 'calcifications'}]


>>> language_model("micro calcifications with usual ductal hyperplasia and no [MASK] lesions identified.")

[{'sequence': '[CLS] micro calcifications with usual ductal hyperplasia and no atypical lesions identified. [SEP]',
  'score': 0.902,
  'token': 472,
  'token_str': 'atypical'},
 {'sequence': '[CLS] micro calcifications with usual ductal hyperplasia and no proliferative lesions identified. [SEP]',
  'score': 0.054,
  'token': 667,
  'token_str': 'proliferative'},
 {'sequence': '[CLS] micro calcifications with usual ductal hyperplasia and no papillary lesions identified. [SEP]',
  'score': 0.009,
  'token': 1177,
  'token_str': 'papillary'},
 {'sequence': '[CLS] micro calcifications with usual ductal hyperplasia and no invasive lesions identified. [SEP]',
  'score': 0.003,
  'token': 385,
  'token_str': 'invasive'},
 {'sequence': '[CLS] micro calcifications with usual ductal hyperplasia and no malignant lesions identified. [SEP]',
  'score': 0.003,
  'token': 581,
  'token_str': 'malignant'}]

More Information

Refer to the original paper, Pre-trained Vs. A New Transformer Language Model for A Specific Domain - Breast Pathology Use-case for additional details and masked language performance on Pathology Specimen Reports

Hierarchical BERT Classification For Breast Cancer

We also provided a Hierarchical Classification System that uses PathologyBERT as base to classify Pathology Specimen Reports.

You can test the system directly on HuggingFace: https://huggingface.co/spaces/tsantos/Hierarchical-Classification-System-for-Breast-Cancer

Github: https://github.com/thiagosantos1/HCSBC

Questions?

If you have any questions, please email thiagogyn.maia@gmail.com