obi
/

deid_roberta_i2b2

Token Classification

deidentification

Model card Files Files and versions Community

Model Description

A RoBERTa [Liu et al., 2019] model fine-tuned for de-identification of medical notes.
Sequence Labeling (token classification): The model was trained to predict protected health information (PHI/PII) entities (spans). A list of protected health information categories is given by HIPAA.
A token can either be classified as non-PHI or as one of the 11 PHI types. Token predictions are aggregated to spans by making use of BILOU tagging.
The PHI labels that were used for training and other details can be found here: Annotation Guidelines
More details on how to use this model, the format of data and other useful information is present in the GitHub repo: Robust DeID.

How to use

A demo on how the model works (using model predictions to de-identify a medical note) is on this space: Medical-Note-Deidentification.
Steps on how this model can be used to run a forward pass can be found here: Forward Pass
In brief, the steps are:
- Sentencize (the model aggregates the sentences back to the note level) and tokenize the dataset.
- Use the predict function of this model to gather the predictions (i.e., predictions for each token).
- Additionally, the model predictions can be used to remove PHI from the original note/text.

Dataset

The I2B2 2014 [Stubbs and Uzuner, 2015] dataset was used to train this model.

	I2B2		I2B2
	TRAIN SET - 790 NOTES		TEST SET - 514 NOTES
PHI LABEL	COUNT	PERCENTAGE	COUNT	PERCENTAGE
DATE	7502	43.69	4980	44.14
STAFF	3149	18.34	2004	17.76
HOSP	1437	8.37	875	7.76
AGE	1233	7.18	764	6.77
LOC	1206	7.02	856	7.59
PATIENT	1316	7.66	879	7.79
PHONE	317	1.85	217	1.92
ID	881	5.13	625	5.54
PATORG	124	0.72	82	0.73
EMAIL	4	0.02	1	0.01
OTHERPHI	2	0.01	0	0
TOTAL	17171	100	11283	100

Training procedure

Steps on how this model was trained can be found here: Training. The "model_name_or_path" was set to: "roberta-large".
- The dataset was sentencized with the en_core_sci_sm sentencizer from spacy.
- The dataset was then tokenized with a custom tokenizer built on top of the en_core_sci_sm tokenizer from spacy.
- For each sentence we added 32 tokens on the left (from previous sentences) and 32 tokens on the right (from the next sentences).
- The added tokens are not used for learning - i.e, the loss is not computed on these tokens - they are used as additional context.
- Each sequence contained a maximum of 128 tokens (including the 32 tokens added on). Longer sequences were split.
- The sentencized and tokenized dataset with the token level labels based on the BILOU notation was used to train the model.
- The model is fine-tuned from a pre-trained RoBERTa model.
Training details:
- Input sequence length: 128
- Batch size: 32 (16 with 2 gradient accumulation steps)
- Optimizer: AdamW
- Learning rate: 5e-5
- Dropout: 0.1

Results

Questions?

Post a Github issue on the repo: Robust DeID.

Downloads last month: 692,115

Safetensors

Model size

354M params

Tensor type

I64

·

F32

·

Model tree for obi/deid_roberta_i2b2

Finetunes

Spaces using obi/deid_roberta_i2b2 24