Pathology BERT, a Masked Language Model with Breast Pathology Specimens

Browse files

Files changed (7) hide show

README.md +88 -0
config.json +24 -0
model_args.json +1 -0
special_tokens_map.json +1 -0
tokenizer_config.json +1 -0
training_args.bin +3 -0
vocab.txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,88 @@

+---
+language: "en"
+tags:
+- fill-mask
+---
+# PathologyBERT - Masked Language Model with Breast Pathology Specimens.
+Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. Recently, several studies have explored the utility and efficacy of contextual models in the clinical, medical, and biomedical domains ([BioBERT](https://arxiv.org/pdf/1901.08746.pdf), [ClinicalBERT](https://aclanthology.org/W19-1909/), [SciBERT](https://arxiv.org/abs/1903.10676), [BlueBERT](https://arxiv.org/abs/1906.05474)
+However, while there is a growing interest in developing language models for more specific domains, the current trend appears to prefer re-training general-domain models on specialized corpora rather than developing models from the ground up with specialized vocabulary. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. However, in fields requiring specialized terminology, such as pathology, these models often fail to perform adequately. One of the major reasons for this limitation is because BERT employs [Word-Pieces](https://www.semanticscholar.org/paper/Google%27s-Neural-Machine-Translation-System%3A-the-Gap-Wu-Schuster/dbde7dfa6cae81df8ac19ef500c42db96c3d1edd) for unsupervised input tokenization, a technique that relies on a predetermined set of Word-Pieces. The vocabulary is built such that it contains the most commonly used words or subword units and as a result, any new words can be represented by frequent subwords. Although WordPiece was built to handle suffixes and complex compound words, it often fails with domain-specific terms. For example, while [ClinicalBERT](https://aclanthology.org/W19-1909/) successfully tokenizes the word “endpoint” as [end \#\#point] it fails to tokenize the word "histopathology" as [his \#\#top \#\#ath \#\#olo \#\#gy].
+To facilitate research on language representations in the pathology domain and assist researchers in addressing the current limitations and advancing cancer research, we preset PathologyBERT, a pre-trained masked language model trained on Breast cancer Histopathology specimens reports.
+### Pretraining Hyperparameters
+We used a batch size of 32, a maximum sequence length of 64 (mean size of report is42±26), masked languagemodel probability of 0.15, and a learning rate of 2e-5 for pre-training the Language Model . The model was trained for 300,000 steps. All other BERT default parameters were used.
+### How to use
+You can use this model directly with a pipeline for masked language modeling:
+```python
+>>> from transformers import pipeline
+>>> language_model = pipeline('fill-mask', model='tsantos/PathologyBERT')
+>>> language_model("intraductal papilloma with [MASK] AND MICRO calcifications")
+[{'sequence': '[CLS] intraductal papilloma with sclerosis and micro calcifications [SEP]',
+  'score': 0.871,
+  'token': 2364,
+  'token_str': 'sclerosis'},
+ {'sequence': '[CLS] intraductal papilloma with hyalinization and micro calcifications [SEP]',
+  'score': 0.032,
+  'token': 4046,
+  'token_str': 'hyalinization'},
+ {'sequence': '[CLS] intraductal papilloma with atypia and micro calcifications [SEP]',
+  'score': 0.013,
+  'token': 652,
+  'token_str': 'atypia'},
+ {'sequence': '[CLS] intraductal papilloma with sclerosing and micro calcifications [SEP]',
+  'score': 0.006,
+  'token': 923,
+  'token_str': 'sclerosing'},
+ {'sequence': '[CLS] intraductal papilloma with calcifications and micro calcifications [SEP]',
+  'score': 0.004,
+  'token': 614,
+  'token_str': 'calcifications'}]
+>>> language_model("micro calcifications with usual ductal hyperplasia and no [MASK] lesions identified.")
+[{'sequence': '[CLS] micro calcifications with usual ductal hyperplasia and no atypical lesions identified. [SEP]',
+  'score': 0.902,
+  'token': 472,
+  'token_str': 'atypical'},
+ {'sequence': '[CLS] micro calcifications with usual ductal hyperplasia and no proliferative lesions identified. [SEP]',
+  'score': 0.054,
+  'token': 667,
+  'token_str': 'proliferative'},
+ {'sequence': '[CLS] micro calcifications with usual ductal hyperplasia and no papillary lesions identified. [SEP]',
+  'score': 0.009,
+  'token': 1177,
+  'token_str': 'papillary'},
+ {'sequence': '[CLS] micro calcifications with usual ductal hyperplasia and no invasive lesions identified. [SEP]',
+  'score': 0.003,
+  'token': 385,
+  'token_str': 'invasive'},
+ {'sequence': '[CLS] micro calcifications with usual ductal hyperplasia and no malignant lesions identified. [SEP]',
+  'score': 0.003,
+  'token': 581,
+  'token_str': 'malignant'}]
+```
+## More Information
+Refer to the original paper, [Pre-trained Vs. A New Transformer Language Model for A Specific Domain - Breast Pathology Use-case](https://arxiv.org/) for additional details and masked language performance on Pathology Specimen Reports
+## Questions?
+If you have any questions, please email tpsanto@emory.edu

config.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "architectures": [
+    "BertForMaskedLM"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float32",
+  "transformers_version": "4.16.2",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 13000
+}

model_args.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"adafactor_beta1": null, "adafactor_clip_threshold": 1.0, "adafactor_decay_rate": -0.8, "adafactor_eps": [1e-30, 0.001], "adafactor_relative_step": true, "adafactor_scale_parameter": true, "adafactor_warmup_init": true, "adam_epsilon": 1e-08, "best_model_dir": "../checkpoints/bert/best_model", "cache_dir": "cache_dir/", "config": {}, "cosine_schedule_num_cycles": 0.5, "custom_layer_parameters": [], "custom_parameter_groups": [], "dataloader_num_workers": 0, "do_lower_case": false, "dynamic_quantize": false, "early_stopping_consider_epochs": false, "early_stopping_delta": 0, "early_stopping_metric": "eval_loss", "early_stopping_metric_minimize": true, "early_stopping_patience": 3, "encoding": null, "eval_batch_size": 8, "evaluate_during_training": false, "evaluate_during_training_silent": true, "evaluate_during_training_steps": 2000, "evaluate_during_training_verbose": false, "evaluate_each_epoch": true, "fp16": true, "gradient_accumulation_steps": 1, "learning_rate": 2e-05, "local_rank": -1, "logging_steps": 50, "loss_type": null, "loss_args": {}, "manual_seed": null, "max_grad_norm": 1.0, "max_seq_length": 64, "model_name": null, "model_type": "bert", "multiprocessing_chunksize": -1, "n_gpu": 1, "no_cache": false, "no_save": false, "not_saved_args": [], "num_train_epochs": 100, "optimizer": "AdamW", "output_dir": "../checkpoints/bert", "overwrite_output_dir": true, "polynomial_decay_schedule_lr_end": 1e-07, "polynomial_decay_schedule_power": 1.0, "process_count": 62, "quantized_model": false, "reprocess_input_data": true, "save_best_model": true, "save_eval_checkpoints": true, "save_model_every_epoch": false, "save_optimizer_and_scheduler": true, "save_steps": 2000, "scheduler": "linear_schedule_with_warmup", "silent": false, "skip_special_tokens": true, "tensorboard_dir": null, "thread_count": null, "tokenizer_name": "../checkpoints/bert", "tokenizer_type": null, "train_batch_size": 32, "train_custom_parameters_only": false, "use_cached_eval_features": false, "use_early_stopping": false, "use_hf_datasets": false, "use_multiprocessing": true, "use_multiprocessing_for_evaluation": true, "wandb_kwargs": {}, "wandb_project": null, "warmup_ratio": 0.06, "warmup_steps": 32694, "weight_decay": 1e-07, "model_class": "LanguageModelingModel", "block_size": 64, "config_name": null, "dataset_class": null, "dataset_type": "simple", "discriminator_config": {}, "discriminator_loss_weight": 50.0, "generator_config": {}, "max_steps": -1, "min_frequency": 2, "mlm": true, "mlm_probability": 0.15, "sliding_window": false, "special_tokens": ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"], "stride": 0.8, "tie_generator_and_discriminator_embeddings": true, "vocab_size": 13000, "clean_text": true, "handle_chinese_chars": true, "special_tokens_list": [], "strip_accents": true}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"do_lower_case": true, "do_basic_tokenize": true, "never_split": null, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": null, "tokenizer_file": null, "name_or_path": "../checkpoints/bert", "tokenizer_class": "BertTokenizer"}

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0491259429e6ea65edd8e5ad55a4e63b4978f204cf7bb8261aad89a0cea6ec62
+size 3375

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff