CLMBR-T-Base

CLMBR-T-Base (CLMBR-Transformer-Base) is a 141 million parameter autoregressive foundation model pretrained on 2.57 million deidentified EHRs from Stanford Medicine.

This is the model from (Wornow et al. 2023), and is based on the CLMBR architecture originally described in (Steinberg et al. 2021) with the original GRU replaced with a Transformer.

As input, this model expects a sequence of coded medical events that have been mapped to Standard Concepts within the OMOP-CDM vocabulary. The model generates representations of patients which can then be used for downstream prediction tasks.

Input patients should be provided in the MEDS schema.

Model Details

Model Description

Developed by: Shah lab @ Stanford University
Funded by: Stanford Healthcare
Shared by: Shah lab @ Stanford University
Model type: CLMBR (Steinberg et al. 2021)
Language(s) (NLP): Electronic health record codes
License: CC-BY NC 4.0
Finetuned from model: N/A -- trained from scratch

Model Sources

Website: https://ehrshot.stanford.edu/
Gitub: https://github.com/som-shahlab/ehrshot-benchmark/
Paper: EHRSHOT: An EHR Benchmark for Few-Shot Evaluation of Foundation Models

Uses

This model is intended to generate representations for patients based on the structured data within their electronic health record. These representations can then be used for downstream tasks such as predicting diagnoses, detecting anomalies, or doing propensity score matching for causal inference.

Direct Use

You will likely want to tune the model for your downstream use case.

Out-of-Scope Use

This model is for research purposes only. It is not for use in any real-world decision making that impacts patients, providers, or hospital operations.

Bias, Risks, and Limitations

This model was trained on a corpus of 2.57 million patients from Stanford Medicine. The model will thus reflect the patterns of how care is delivered at Stanford Medicine, in addition to the racial and socioeconomic makeup of Stanford Medicine's patient base. This model may not generalize well to other hospitals and demographic mixes.

While this is technically a generative model, we have not tested its generative abilities and thus do not anticipate it being used to generate synthetic EHR records. We aim to explore its generative abilities in future work.

How to Get Started with the Model

Use the code below to get started with the model.

First, download the necessary libraries.

pip install torch==2.1.1 femr==0.2.3 datasets==2.15.0 xformers transformers==4.35.2

Second, run the following Python script to run inference on a single patient:

import femr.models.transformer
import torch
import femr.models.tokenizer
import femr.models.processor
import datetime

model_name = "StanfordShahLab/clmbr-t-base"

# Load tokenizer / batch loader
tokenizer = femr.models.tokenizer.FEMRTokenizer.from_pretrained(model_name)
batch_processor = femr.models.processor.FEMRBatchProcessor(tokenizer)

# Load model
model = femr.models.transformer.FEMRModel.from_pretrained(model_name)

# Create an example patient to run inference on
# This patient follows the MEDS schema: https://github.com/Medical-Event-Data-Standard
example_patient = {
    'patient_id': 30,
    'events': [{
        'time': datetime.datetime(2011, 5, 8),
        'measurements': [
            {'code': 'SNOMED/184099003'},
            {'code': 'Visit/IP'},
        ],
    },
    {
        'time': datetime.datetime(2012, 6, 9),
        'measurements': [
            {'code': 'Visit/OP'},
            {'code': 'SNOMED/3950001'}
        ],
    }]
}

raw_batch = batch_processor.convert_patient(example_patient, tensor_type="pt")
batch = batch_processor.collate([raw_batch])

# Run model
with torch.no_grad():
    _, result = model(**batch)
    print(result['timestamps'].cpu().numpy().astype('datetime64[s]'))
    print(result['patient_ids'])
    print(result['representations'])

Training Details

Full training details are provided in our accompanying paper, EHRSHOT (Wornow et al. 2023).

Training Data

The model is trained on 2.57 million patients from the Stanford Medicine Research Data Repository (STARR), which contains EHR data from both Stanford Health Care (primarily adult care) and Lucile Packard Children’s Hospital (primarily pediatric care). The dataset contains only structured data (i.e. no clinical text or images) and covers demographics (e.g. age, sex, race), diagnoses, procedures, laboratory results, medication prescriptions, and other coded clinical observations. The data is formatted according to the Observational Medical Outcomes Partnership Common Data Model (OMOP-CDM). All data that we work with is deidentified.

Training Procedure

We train our model using an autoregressive next code prediction objective, i.e. predict the next code in a patient's timeline given their previous codes.

Preprocessing

We use the FEMR Python library for data preprocessing.

Training Hyperparameters

Learning rate: 0.00001
Context window size: 496
Internal dropout: 0
Layers: 12
Hidden dimension: 768

Evaluation

We evaluate this model on the EHRSHOT benchmark.

Information on this benchmark, tasks, and results are detailed in Wornow et al. 2023

Technical Specifications

This model uses the CLMBR architecture from (Steinberg et al. 2021). The objective is an autoregressive next token prediction task. Please see Wornow et al. 2023 for more details on the specific model architecture.

Vocabulary

CLMBR is a language model and requires defining a token vocabulary V. However, unlike natural languages, the vocabulary of a structured EHR language model is defined by medical codes. Here tokens map to standardized concepts in medical ontologies. Since the union of all tokens from all ontologies, V_all, results in a prohibitively large vocabuary, we derive ~V by filtering to the top k most frequent codes as follows:

Knowledge Graphs (G): A set of n medical ontologies (knowledge graphs), G = ({G_1, G_2, ..., G_n}), defined by Athena's OMOP Vocabulary List.
Medical Codes as Tokens: Each knowledge graph G_i has a set of unique medical codes M_i. The union of all these codes serve as the tokens in our complete vocabulary V_all = M_1 ∪ M_2 ∪ ... ∪ M_n. Our final, filtered vocabulary is then ~V = sort_freq(V_all)[1:k] where frequency is calculated over our STARR EHR OMOP dataset.

CLMBR Vocabulary Summary

21 Source Ontologies/Knowledge Graphs
65,536 tokens (the max value of uint16_t)

PREFIX	SOURCE	SIZE	EXAMPLE TOKENS
LOINC	Logical Observation Identifiers Names and Codes (Regenstrief Institute)	37,590	31790-9, 20449-5
SNOMED	Systematic Nomenclature of Medicine - Clinical Terms (IHTSDO)	18,174	105013009, 200755008
RxNorm	RxNorm (NLM)	4,678	2375327, 372375
CPT4	Current Procedural Terminology version 4 (AMA)	3,730	00790, 36818
RxNorm Extension	OMOP RxNorm Extension	255	OMOP358911, OMOP2153393
ICD10PCS	ICD-10 Procedure Coding System (CMS)	233	10907ZC, 4A0234Z
ICD9Proc	International Classification of Diseases, Ninth Revision, Clinical Modification, Volume 3 (NCHS)	196	68.29, 03.93
Cancer Modifier	Diagnostic Modifiers of Cancer (OMOP)	88	c-8th_AJCC/UICC-Stage-2C, p-7th_AJCC/UICC-Stage-3B
HCPCS	Healthcare Common Procedure Coding System (CMS)	54	C1878, P7001
ICDO3	International Classification of Diseases for Oncology, Third Edition (WHO)	52	NULL-C34.8, C56.9
CVX	CDC Vaccine Administered CVX (NCIRD)	41	151, 158
Domain	OMOP	27	OMOP generated
Race	Race and Ethnicity Code Set (USBC)	5	5, 4
OMOP Extension	OMOP Extension (OHDSI)	3	OMOP5160861, OMOP4912978
Gender	OMOP Gender	2	F, M
Ethnicity	OMOP Ethnicity	2	Not Hispanic, Hispanic
CMS Place of Service	Place of Service Codes for Professional Claims (CMS)	2	OMOP4822036, 02
Medicare Specialty	Medicare provider/supplier specialty codes (CMS)	1	A0
Condition Type	OMOP	1	OMOP4822053
CARE_SITE	STANFORD_CUSTOM	396	7930934, 7929373
Visit	STANFORD_CUSTOM	6	ERIP, ER

Citation

BibTeX:

Please cite the following papers if you use CLMBR-T-base in your work.

@article{wornow2023ehrshot,
  title={EHRSHOT: An EHR Benchmark for Few-Shot Evaluation of Foundation Models}, 
  author={Michael Wornow and Rahul Thapa and Ethan Steinberg and Jason Fries and Nigam Shah},
  booktitle={Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2023}
}

@article{guo2024multi,
  title={A multi-center study on the adaptability of a shared foundation model for electronic health records},
  author={Guo, Lin Lawrence and Fries, Jason and Steinberg, Ethan and Fleming, Scott Lanyon and Morse, Keith and Aftandilian, Catherine and Posada, Jose and Shah, Nigam and Sung, Lillian},
  journal={NPJ Digital Medicine},
  volume={7},
  number={1},
  pages={171},
  year={2024},
  publisher={Nature Publishing Group UK London}
}

Model Card Authors

Michael Wornow, Ethan Steinberg, Rahul Thapa, Jason Fries, Nigam H. Shah

Model Card Contact

Michael Wornow (mwornow@stanford.edu)

StanfordShahLab
/

clmbr-t-base

You need to agree to share your contact information to access this model