Acknowledge license to accept the repository

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

DeepSpotM is released for non-commercial academic research only, under CC-BY-NC-SA-4.0. Requests with vague or insufficient descriptions of intended use will be declined.

DeepSpot-M

DeepSpot-M: a multimodal foundation model for transcriptome-wide virtual spatial transcriptomics from histology.

DeepSpot-M is a multimodal foundation model that maps a histology image tile to spatial gene expression. It tokenises a 224x224 H&E tile with a LoRA-adapted pathology foundation backbone (Midnight) and lets each gene query attend to the patch tokens through a cross-attention gene decoder. A gene router hypernetwork generates gene-specific output projections from frozen biological embeddings drawn from DNA, RNA, protein, single-cell and text foundation models (Evo 2, Orthrus, ProtT5, scGPT, Apertus). Because genes are represented as queryable embeddings rather than fixed outputs, one model predicts transcriptome-wide expression and genes it never saw during training.

Code is available on GitHub.

Fig. DeepSpot-M predicts transcriptome-wide spatial gene expression from histology. A 224x224 H&E tile is tokenised into spatial patch embeddings by a LoRA-adapted pathology foundation model. A cross-attention gene decoder lets each gene query independently attend to patch tokens via multi-head attention, and a gene router hypernetwork generates gene-specific output projections from frozen biological embeddings drawn from DNA, RNA, protein, single-cell and text foundation models. This design enables zero-shot prediction of genes at inference time.

⚠️ Research use only. Not for clinical or diagnostic use.

Model description

DeepSpot-M adapts the Midnight pathology backbone with LoRA and feeds its patch tokens to a cross-attention gene decoder conditioned on biological gene embeddings. It takes 224x224 H&E tiles as input and outputs expression over the ~19k-gene panel in tokens.csv. Five embedding sources are available, namely evo2, orthrus, prott5, scgpt and apertus, selected at inference with source=.

Usage

from deepspotm import DeepSpotM   # pip install git+https://github.com/ratschlab/DeepSpotM.git

model, image_processor = DeepSpotM.from_pretrained(
    "ratschlab/DeepSpotM",
    source="scgpt",   # one of evo2, orthrus, prott5, scgpt, apertus
)

import torch
tile = image_processor(my_pil_tile).unsqueeze(0)   # 224x224 H&E tile
with torch.no_grad():
    expression, _, _ = model(tile)                 # (1, 19338)

# Output column i corresponds to model.gene_names[i].
preds = dict(zip(model.gene_names, expression.squeeze(0).tolist()))
print(preds["EPCAM"])

The predicted vector is ordered by model.gene_names, the genes in tokens.csv, so model.gene_names[i] is the symbol for output column i.

Predict only specific genes (faster)

You don't have to predict all ~19k genes. Pass a gene or a list and only those are computed, because the cross-attention runs over just the requested gene queries.

vals = model.predict_genes(tile, ["EPCAM", "CD3D", "PTPRC"])   # (1, 3)
vals = model.predict_genes(tile, "EPCAM")                       # (1, 1)

Output columns follow the requested order. Unknown symbols raise KeyError.

The vision backbone is built offline from a bundled config and its weights are baked into model.safetensors, so loading needs no network access to the upstream backbone repo.

Tutorial

examples/predict_tcga_skcm.ipynb runs DeepSpot-M end to end on a whole-slide TCGA-SKCM H&E image. It tiles the slide, predicts BRAF, CD37 and COL1A1, and overlays the predictions on the tissue.

Resources

Code, github.com/ratschlab/DeepSpotM
TCGA virtual spatial transcriptomics atlas of 28,664 slides across 32 cancers, ratschlab/TCGA_virtual_spatial_transcriptomics_atlas
HEST-1K virtual single-cell Xenium profiles for 59 samples, ratschlab/HEST_Xenium_virtual_spatial_transcriptomics

Limitations and biases

Trained on a finite set of cancer indications. Performance on unseen tissue types, stains, scanners or resolutions may degrade.
Predicts relative expression rather than absolute counts. Under-sequenced genes are predicted less reliably.
Trained on oncology cohorts, so it is not representative of healthy tissue or non-oncology contexts. Not for clinical or diagnostic use.

License

Weights, CC-BY-NC-SA-4.0. Non-commercial, ShareAlike, with attribution.
Code, github.com/ratschlab/DeepSpotM, under PolyForm Noncommercial 1.0.0.

See WEIGHTS_LICENSE.md and THIRD_PARTY_LICENSES.md.

Citation

Paper: DeepSpot-M: a multimodal foundation model for transcriptome-wide virtual spatial transcriptomics from histology (medRxiv, 2026).

@article{nonchev2026deepspotm,
  title   = {DeepSpot-M: a multimodal foundation model for transcriptome-wide virtual spatial transcriptomics from histology},
  author  = {Nonchev, Kalin and Dawo, Sebastian and Silina, Karina and Koelzer, Viktor H. and Raetsch, Gunnar},
  journal = {medRxiv},
  year    = {2026},
  doi     = {10.64898/2026.06.19.26356060},
  url     = {https://www.medrxiv.org/content/10.64898/2026.06.19.26356060v1}
}

Model tree for ratschlab/DeepSpotM

Base model

facebook/dinov2-giant

Finetuned

kaiko-ai/midnight

Adapter

(1)

this model