xyla
/

Clinical-T5-Large

Model card Files Files and versions Community

Clinical-T5-Large / README.md

xyla's picture

Update README.md

c308d3b about 2 years ago

|

2.5 kB

	---
	license: mit
	---

	# Clinical-T5 Models
	We train four different T5 variants on the union of MIMIC-III and MIMIC-IV: (1) Initialized from T5-Base,
	(2) Initialized from SciFive-Base, (3) T5-Base initialized from scratch, and (4) T5-Large initialized from scratch.

	This particular model card describes the T5-Large model trained from scratch on MIMIC notes.

	# Model Pretraining
	In this section, we will describe the pretraining procedure.

	### Pretraining Data
	We train on the union of MIMIC-III and MIMIC-IV. MIMIC-III contains a wide variety of note types, whereas MIMIC-IV contains only radiology reports and discharge summaries. We remove duplicate notes. This results in ~1.2B words.

	### Note Preprocessing
	We make two important preprocessing steps:
	* We replace all DEID tags with special tokens. For example, `"The patient, [First Name 123], has a history of high blood pressure"` is replaced with `"The patient, [NAME], has a history of high blood pressure"`.
	* We remove any duplicate notes based on edit times. There are roughly ~300M/800M words from MIMIC-III, which are repeats of the same note, with only a few words changed! This is due to the fact that a nurse might save a note, and then edit it 10 minutes later. Both would appear.

	### Pretraining Procedures
	We train the Clinical-T5-Large model from scratch using a cased-vocab of 32,000. We train it for 780,000 steps, using a batch size of 12 per TPU pod (8 pods total), and a sequence length of 512.
	This results in a batch size of 49,152. Accounting for the number of steps, this equates to 38B tokens. We were aiming for 40B, but our Google Cloud instance broke!

	# How to use the Model
	You will first need to have credentialed PhysioNet access to use model. Why? There is reasonable evidence that these models contain leakage, especially the larger ones. Releasing a model that leaks these notes would be a data-use agreement violation. To get PhysioNet access, you must pass the CITI training.
	Once you have PhysioNet, access the model by doing the following:
	```
	wget -r -N -c -np --user "INSERT_USER" --ask-password https://physionet.org/files/clinical-t5/1.0.0/
	```

	Then, you can load the model + tokenizer:
	```
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
	tokenizer = AutoTokenizer.from_pretrained(INSERT_PATH_TO_MODEL_FOLDER)
	model = AutoModelForSeq2SeqLM.from_pretrained(PATH_TO_MODEL_FOLDER)
	```

	# Questions?
	If you have any questions about using the models, please email eric@xyla.com.