Thiago commited on
Commit
5ad0ba9
1 Parent(s): da77ac2

Pathology BERT, a Masked Language Model with Breast Pathology Specimens

Browse files
README.md ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: "en"
3
+ tags:
4
+ - fill-mask
5
+
6
+ ---
7
+
8
+ # PathologyBERT - Masked Language Model with Breast Pathology Specimens.
9
+
10
+ Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. Recently, several studies have explored the utility and efficacy of contextual models in the clinical, medical, and biomedical domains ([BioBERT](https://arxiv.org/pdf/1901.08746.pdf), [ClinicalBERT](https://aclanthology.org/W19-1909/), [SciBERT](https://arxiv.org/abs/1903.10676), [BlueBERT](https://arxiv.org/abs/1906.05474)
11
+ However, while there is a growing interest in developing language models for more specific domains, the current trend appears to prefer re-training general-domain models on specialized corpora rather than developing models from the ground up with specialized vocabulary. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. However, in fields requiring specialized terminology, such as pathology, these models often fail to perform adequately. One of the major reasons for this limitation is because BERT employs [Word-Pieces](https://www.semanticscholar.org/paper/Google%27s-Neural-Machine-Translation-System%3A-the-Gap-Wu-Schuster/dbde7dfa6cae81df8ac19ef500c42db96c3d1edd) for unsupervised input tokenization, a technique that relies on a predetermined set of Word-Pieces. The vocabulary is built such that it contains the most commonly used words or subword units and as a result, any new words can be represented by frequent subwords. Although WordPiece was built to handle suffixes and complex compound words, it often fails with domain-specific terms. For example, while [ClinicalBERT](https://aclanthology.org/W19-1909/) successfully tokenizes the word “endpoint” as [end \#\#point] it fails to tokenize the word "histopathology" as [his \#\#top \#\#ath \#\#olo \#\#gy].
12
+
13
+
14
+
15
+ To facilitate research on language representations in the pathology domain and assist researchers in addressing the current limitations and advancing cancer research, we preset PathologyBERT, a pre-trained masked language model trained on Breast cancer Histopathology specimens reports.
16
+
17
+
18
+ ### Pretraining Hyperparameters
19
+ We used a batch size of 32, a maximum sequence length of 64 (mean size of report is42±26), masked languagemodel probability of 0.15, and a learning rate of 2e-5 for pre-training the Language Model . The model was trained for 300,000 steps. All other BERT default parameters were used.
20
+
21
+
22
+
23
+ ### How to use
24
+
25
+ You can use this model directly with a pipeline for masked language modeling:
26
+
27
+ ```python
28
+ >>> from transformers import pipeline
29
+ >>> language_model = pipeline('fill-mask', model='tsantos/PathologyBERT')
30
+ >>> language_model("intraductal papilloma with [MASK] AND MICRO calcifications")
31
+
32
+ [{'sequence': '[CLS] intraductal papilloma with sclerosis and micro calcifications [SEP]',
33
+ 'score': 0.871,
34
+ 'token': 2364,
35
+ 'token_str': 'sclerosis'},
36
+ {'sequence': '[CLS] intraductal papilloma with hyalinization and micro calcifications [SEP]',
37
+ 'score': 0.032,
38
+ 'token': 4046,
39
+ 'token_str': 'hyalinization'},
40
+ {'sequence': '[CLS] intraductal papilloma with atypia and micro calcifications [SEP]',
41
+ 'score': 0.013,
42
+ 'token': 652,
43
+ 'token_str': 'atypia'},
44
+ {'sequence': '[CLS] intraductal papilloma with sclerosing and micro calcifications [SEP]',
45
+ 'score': 0.006,
46
+ 'token': 923,
47
+ 'token_str': 'sclerosing'},
48
+ {'sequence': '[CLS] intraductal papilloma with calcifications and micro calcifications [SEP]',
49
+ 'score': 0.004,
50
+ 'token': 614,
51
+ 'token_str': 'calcifications'}]
52
+
53
+
54
+ >>> language_model("micro calcifications with usual ductal hyperplasia and no [MASK] lesions identified.")
55
+
56
+ [{'sequence': '[CLS] micro calcifications with usual ductal hyperplasia and no atypical lesions identified. [SEP]',
57
+ 'score': 0.902,
58
+ 'token': 472,
59
+ 'token_str': 'atypical'},
60
+ {'sequence': '[CLS] micro calcifications with usual ductal hyperplasia and no proliferative lesions identified. [SEP]',
61
+ 'score': 0.054,
62
+ 'token': 667,
63
+ 'token_str': 'proliferative'},
64
+ {'sequence': '[CLS] micro calcifications with usual ductal hyperplasia and no papillary lesions identified. [SEP]',
65
+ 'score': 0.009,
66
+ 'token': 1177,
67
+ 'token_str': 'papillary'},
68
+ {'sequence': '[CLS] micro calcifications with usual ductal hyperplasia and no invasive lesions identified. [SEP]',
69
+ 'score': 0.003,
70
+ 'token': 385,
71
+ 'token_str': 'invasive'},
72
+ {'sequence': '[CLS] micro calcifications with usual ductal hyperplasia and no malignant lesions identified. [SEP]',
73
+ 'score': 0.003,
74
+ 'token': 581,
75
+ 'token_str': 'malignant'}]
76
+ ```
77
+
78
+
79
+ ## More Information
80
+
81
+ Refer to the original paper, [Pre-trained Vs. A New Transformer Language Model for A Specific Domain - Breast Pathology Use-case](https://arxiv.org/) for additional details and masked language performance on Pathology Specimen Reports
82
+
83
+
84
+ ## Questions?
85
+
86
+ If you have any questions, please email tpsanto@emory.edu
87
+
88
+
config.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertForMaskedLM"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "classifier_dropout": null,
7
+ "hidden_act": "gelu",
8
+ "hidden_dropout_prob": 0.1,
9
+ "hidden_size": 768,
10
+ "initializer_range": 0.02,
11
+ "intermediate_size": 3072,
12
+ "layer_norm_eps": 1e-12,
13
+ "max_position_embeddings": 512,
14
+ "model_type": "bert",
15
+ "num_attention_heads": 12,
16
+ "num_hidden_layers": 12,
17
+ "pad_token_id": 0,
18
+ "position_embedding_type": "absolute",
19
+ "torch_dtype": "float32",
20
+ "transformers_version": "4.16.2",
21
+ "type_vocab_size": 2,
22
+ "use_cache": true,
23
+ "vocab_size": 13000
24
+ }
model_args.json ADDED
@@ -0,0 +1 @@
 
1
+ {"adafactor_beta1": null, "adafactor_clip_threshold": 1.0, "adafactor_decay_rate": -0.8, "adafactor_eps": [1e-30, 0.001], "adafactor_relative_step": true, "adafactor_scale_parameter": true, "adafactor_warmup_init": true, "adam_epsilon": 1e-08, "best_model_dir": "../checkpoints/bert/best_model", "cache_dir": "cache_dir/", "config": {}, "cosine_schedule_num_cycles": 0.5, "custom_layer_parameters": [], "custom_parameter_groups": [], "dataloader_num_workers": 0, "do_lower_case": false, "dynamic_quantize": false, "early_stopping_consider_epochs": false, "early_stopping_delta": 0, "early_stopping_metric": "eval_loss", "early_stopping_metric_minimize": true, "early_stopping_patience": 3, "encoding": null, "eval_batch_size": 8, "evaluate_during_training": false, "evaluate_during_training_silent": true, "evaluate_during_training_steps": 2000, "evaluate_during_training_verbose": false, "evaluate_each_epoch": true, "fp16": true, "gradient_accumulation_steps": 1, "learning_rate": 2e-05, "local_rank": -1, "logging_steps": 50, "loss_type": null, "loss_args": {}, "manual_seed": null, "max_grad_norm": 1.0, "max_seq_length": 64, "model_name": null, "model_type": "bert", "multiprocessing_chunksize": -1, "n_gpu": 1, "no_cache": false, "no_save": false, "not_saved_args": [], "num_train_epochs": 100, "optimizer": "AdamW", "output_dir": "../checkpoints/bert", "overwrite_output_dir": true, "polynomial_decay_schedule_lr_end": 1e-07, "polynomial_decay_schedule_power": 1.0, "process_count": 62, "quantized_model": false, "reprocess_input_data": true, "save_best_model": true, "save_eval_checkpoints": true, "save_model_every_epoch": false, "save_optimizer_and_scheduler": true, "save_steps": 2000, "scheduler": "linear_schedule_with_warmup", "silent": false, "skip_special_tokens": true, "tensorboard_dir": null, "thread_count": null, "tokenizer_name": "../checkpoints/bert", "tokenizer_type": null, "train_batch_size": 32, "train_custom_parameters_only": false, "use_cached_eval_features": false, "use_early_stopping": false, "use_hf_datasets": false, "use_multiprocessing": true, "use_multiprocessing_for_evaluation": true, "wandb_kwargs": {}, "wandb_project": null, "warmup_ratio": 0.06, "warmup_steps": 32694, "weight_decay": 1e-07, "model_class": "LanguageModelingModel", "block_size": 64, "config_name": null, "dataset_class": null, "dataset_type": "simple", "discriminator_config": {}, "discriminator_loss_weight": 50.0, "generator_config": {}, "max_steps": -1, "min_frequency": 2, "mlm": true, "mlm_probability": 0.15, "sliding_window": false, "special_tokens": ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"], "stride": 0.8, "tie_generator_and_discriminator_embeddings": true, "vocab_size": 13000, "clean_text": true, "handle_chinese_chars": true, "special_tokens_list": [], "strip_accents": true}
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
1
+ {"do_lower_case": true, "do_basic_tokenize": true, "never_split": null, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": null, "tokenizer_file": null, "name_or_path": "../checkpoints/bert", "tokenizer_class": "BertTokenizer"}
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0491259429e6ea65edd8e5ad55a4e63b4978f204cf7bb8261aad89a0cea6ec62
3
+ size 3375
vocab.txt ADDED
The diff for this file is too large to render. See raw diff