RD-Embed
RD-Embed is a rare-disease representation model that maps clinical narratives, HPO phenotype profiles, SNOMED-CT codes, and disease/gene descriptions into a shared 512-dimensional embedding space, enabling diagnosis retrieval, gene prioritization, differential ranking, and phenotype inference by simple cosine similarity. It is built in three progressive stages on top of MedEmbed-large-v0.1: Stage 1 applies ontology-aware contrastive learning over HPO, OMIM, and Orphanet to align entities with biomedical structure; Stage 2 aligns a GatorTron clinical-text encoder (with an optional SNOMED-CT bridge) to the Stage 1 entity space so that free-text and EHR notes can be matched against curated disease and gene concepts; and Stage 3 applies a Heterogeneous Graph Transformer (HGT) over the disease–phenotype–gene knowledge graph to refine embeddings with multi-hop relational context. The released checkpoints expose each configuration individually: Stage 1, HGT (to produce Stage 3 from Stage 1 + HGT), Stage 2, and the Stage 2+3 aligned (Stage 2 + HGT) model.
Across seven evaluations spanning eleven clinical corpora, RD-Embed substantially outperforms general biomedical language models. Full details on the evaluation protocol and results are provided in the paper. The model stages are complementary: Stage 1+3 is best for graph-based reasoning (diagnosis, gene ID, zero-shot transfer), while Stage 1's linear space is better for incremental phenotype aggregation - so model choice is task-dependent.
Model Details
Model Description
- Language(s) (NLP): Enlgish
- License: apache-2.0
- Finetuned from model: abhinand/MedEmbed-large-v0.1 and UFNLP/gatortron-base
Model Sources
Uses
The model is intended for use in medical and clinical contexts to improve information retrieval, question answering, and semantic search tasks. It can be integrated into healthcare systems, research tools, and medical literature databases to enhance search capabilities and information access.
Direct Use
Ranking of rare diseases and genes using free text input, HPO or SNOMED codes.
Bias, Risks, and Limitations
Users should be aware of potential biases in medical data and the ethical implications of AI in healthcare. This model should be used as a tool to assist, not replace, human expertise in medical decision-making.
Training Details
Training data and regime are provided in the publication (https://www.medrxiv.org/content/10.64898/2026.04.02.26350083v1).
Evaluation
Complete evaluation protocol and results across 9 corpora and several tasks are provided in the paper and its Supplementary Materials (https://www.medrxiv.org/content/10.64898/2026.04.02.26350083v1).
Citation
BibTeX:
@article{groza2026rdembed,
title={RD-Embed: Unified representations of rare-disease knowledge from clinical records},
author={Tudor Groza and Freddie Tan and Noah Tian Run Lim and Maheshwaran Windersalam Shanmugasundar and Jhanvi Kappaganthu and Jane Andrea Lieviant and Neerja Karnani and Haichao Chen and Tien Y Wong and Saumya Shekhar Jamuar},
journal={medRxiv},
pages={2026.04. 02.26350083},
year={2026},
doi={https://doi.org/10.64898/2026.04.02.26350083}
}
Model tree for tudorgroza/rd-embed
Base model
UFNLP/gatortron-base