RD-Embed

RD-Embed is a rare-disease representation model that maps clinical narratives, HPO phenotype profiles, SNOMED-CT codes, and disease/gene descriptions into a shared 512-dimensional embedding space, enabling diagnosis retrieval, gene prioritization, differential ranking, and phenotype inference by simple cosine similarity. It is built in three progressive stages on top of MedEmbed-large-v0.1: Stage 1 applies ontology-aware contrastive learning over HPO, OMIM, and Orphanet to align entities with biomedical structure; Stage 2 aligns a GatorTron clinical-text encoder (with an optional SNOMED-CT bridge) to the Stage 1 entity space so that free-text and EHR notes can be matched against curated disease and gene concepts; and Stage 3 applies a Heterogeneous Graph Transformer (HGT) over the disease–phenotype–gene knowledge graph to refine embeddings with multi-hop relational context. The released checkpoints expose each configuration individually: Stage 1, HGT (to produce Stage 3 from Stage 1 + HGT), Stage 2, and the Stage 2+3 aligned (Stage 2 + HGT) model.

Across seven evaluations spanning eleven clinical corpora, RD-Embed substantially outperforms general biomedical language models. Full details on the evaluation protocol and results are provided in the paper. The model stages are complementary: Stage 1+3 is best for graph-based reasoning (diagnosis, gene ID, zero-shot transfer), while Stage 1's linear space is better for incremental phenotype aggregation - so model choice is task-dependent.

Model Details

Model Description

Language(s) (NLP): Enlgish
License: apache-2.0
Finetuned from model: abhinand/MedEmbed-large-v0.1 and UFNLP/gatortron-base

Model Sources

Paper: https://www.medrxiv.org/content/10.64898/2026.04.02.26350083v1

Uses

The model is intended for use in medical and clinical contexts to improve information retrieval, question answering, and semantic search tasks. It can be integrated into healthcare systems, research tools, and medical literature databases to enhance search capabilities and information access.

Direct Use

Ranking of rare diseases and genes using free text input, HPO or SNOMED codes.

Bias, Risks, and Limitations

Users should be aware of potential biases in medical data and the ethical implications of AI in healthcare. This model should be used as a tool to assist, not replace, human expertise in medical decision-making.

Training Details

Training data and regime are provided in the publication (https://www.medrxiv.org/content/10.64898/2026.04.02.26350083v1).

Evaluation

Complete evaluation protocol and results across 9 corpora and several tasks are provided in the paper and its Supplementary Materials (https://www.medrxiv.org/content/10.64898/2026.04.02.26350083v1).

Citation

BibTeX:

@article{groza2026rdembed,
  title={RD-Embed: Unified representations of rare-disease knowledge from clinical records},
  author={Tudor Groza and Freddie Tan and  Noah Tian Run Lim and Maheshwaran Windersalam Shanmugasundar and Jhanvi Kappaganthu and Jane Andrea Lieviant and Neerja Karnani and Haichao Chen and Tien Y Wong and Saumya Shekhar Jamuar},
  journal={medRxiv},
  pages={2026.04. 02.26350083},
  year={2026},
  doi={https://doi.org/10.64898/2026.04.02.26350083}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for tudorgroza/rd-embed

Base model

UFNLP/gatortron-base

Finetuned

(3)

this model