|
--- |
|
license: mit |
|
--- |
|
|
|
# Clinical-T5 Models |
|
We train four different T5 variants on the union of MIMIC-III and MIMIC-IV: (1) Initialized from T5-Base, |
|
(2) Initialized from SciFive-Base, (3) T5-Base initialized from scratch, and (4) T5-Large initialized from scratch. |
|
|
|
This particular model card describes the T5-Large model trained from scratch on MIMIC notes. |
|
|
|
# Model Pretraining |
|
In this section, we will describe the pretraining procedure. |
|
|
|
### Pretraining Data |
|
We train on the union of MIMIC-III and MIMIC-IV. MIMIC-III contains a wide variety of note types, whereas MIMIC-IV contains only radiology reports and discharge summaries. We remove duplicate notes. This results in ~1.2B words. |
|
|
|
### Note Preprocessing |
|
We make two important preprocessing steps: |
|
* We replace all DEID tags with special tokens. For example, `"The patient, [**First Name 123**], has a history of high blood pressure"` is replaced with `"The patient, [NAME], has a history of high blood pressure"`. |
|
* We remove any duplicate notes based on edit times. There are roughly ~300M/800M words from MIMIC-III, which are repeats of the same note, with only a few words changed! This is due to the fact that a nurse might save a note, and then edit it 10 minutes later. Both would appear. |
|
|
|
### Pretraining Procedures |
|
We train the Clinical-T5-Large model from scratch using a cased-vocab of 32,000. We train it for 780,000 steps, using a batch size of 12 per TPU pod (8 pods total), and a sequence length of 512. |
|
This results in a batch size of 49,152. Accounting for the number of steps, this equates to 38B tokens. We were aiming for 40B, but our Google Cloud instance broke! |
|
|
|
# How to use the Model |
|
You will first need to have credentialed PhysioNet access to use model. Why? There is reasonable evidence that these models contain leakage, especially the larger ones. Releasing a model that leaks these notes would be a data-use agreement violation. To get PhysioNet access, you must pass the CITI training. |
|
Once you have PhysioNet, access the model by doing the following: |
|
``` |
|
wget -r -N -c -np --user "INSERT_USER" --ask-password https://physionet.org/files/clinical-t5/1.0.0/ |
|
``` |
|
|
|
Then, you can load the model + tokenizer: |
|
``` |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
tokenizer = AutoTokenizer.from_pretrained(INSERT_PATH_TO_MODEL_FOLDER) |
|
model = AutoModelForSeq2SeqLM.from_pretrained(PATH_TO_MODEL_FOLDER) |
|
``` |
|
|
|
# Questions? |
|
If you have any questions about using the models, please email eric@xyla.com. |