SpatialWhisperer

SpatialWhisperer is a trimodal embedding model that aligns hematoxylin & eosin (H&E) image patches, gene-expression profiles, and free-text descriptions into a shared 2048-dimensional space. It enables zero-shot cell-type annotation of H&E patches and natural-language querying over histopathology and spatial transcriptomics data.

This repository hosts the main checkpoint (seed=0) from the ICML 2026 paper Trimodal Learning Enhances Zero-Shot Histopathology Annotation (anonymized name \ourmethod).

Model architecture

Three encoders project into a shared embedding space:

Modality	Encoder	Freezing
Image (H&E)	UNI2	locked
Transcriptome	Geneformer (12L)	locked
Text	BioBERT v1.1	unfrozen

Following LiT convention, the freezing pattern is LUL (image locked, text unlocked, transcriptome locked). Only the text tower and the three projection heads are trained. Projection dimension is 2048.

Training data

Three paired datasets cover the three modality pairs:

HEST-1K — H&E ↔ spatial gene expression (Visium-style spots)
cellxgene_census — gene expression ↔ free-text cell/sample metadata
ARCHS4/GEO — gene expression ↔ free-text sample descriptions

Training was 4 epochs with AdamW at learning rate 1e-5 and cosine schedule (warmup 3%), batch size 512, on a single H100 GPU. This checkpoint reflects epoch 3, global step 14624.

Evaluation

Reported AUROC on cell-type benchmarks (mean across cell types):

Benchmark	SpatialWhisperer	Best published baseline	Δ rel.
PathoCell	0.630	0.554	+13.7%
Lizard	(see paper)	—	+15.9%
PanNuke	(see paper)	—	+13.7%

Modality-pair benchmarks (Tabula Sapiens, HEST-1K, Skin Conditions) confirm the trimodal model retains per-pair performance under low-n subsampling. See the paper for full numbers.

How to use

The checkpoint is a stripped Lightning state-dict (~505 MB, 236 tensors covering the trained BioBERT text tower and the three 2048-d projection heads) plus its hyper_parameters block. Foundation model weights are NOT included — the locked UNI2 image encoder and locked Geneformer transcriptome encoder are re-instantiated at load time from their original providers (and remain under their respective licenses). The ckpt's hyper_parameters.model_config.use_cache = True flag triggers the FrozenCachedModel wrapping that excludes the locked towers from state_dict during load.

Loading requires the cellwhisperer code at https://github.com/moritzschaefer/spatialwhisperer (model code) and the foundation models (UNI2, Geneformer, BioBERT v1.1), which are downloaded by the cellwhisperer setup scripts.

from cellwhisperer.utils.model_io import load_cellwhisperer_model

model, tokenizer, transcriptome_proc, image_proc = load_cellwhisperer_model(
    model_path="hf://moritzschaefer/spatialwhisperer"
)
# model is a TranscriptomeTextDualEncoderLightning in eval mode

While the repo is private, export a token first:

export HUGGINGFACE_TOKEN=$(pass api_keys/huggingface_write)  # or any read token with access

To compute image–text similarities for zero-shot cell-type annotation, encode patches and class-name strings, then take cosine similarity in the shared 2048-d space. See examples/zero_shot_celltype.py in the model code repository.

Intended use & limitations

Intended. Research on multimodal histopathology, cell-type annotation, spatial transcriptomics analysis, and natural-language querying over H&E and gene-expression data.

Not intended. Clinical diagnosis or treatment decisions. The model was trained on academic datasets and is not validated for clinical use.

Known limitations.

Trained on Visium-scale spots (~55 μm); finer-grained image–expression alignment is not guaranteed.
BioBERT vocabulary constrains the text tower; rare technical terms may be out-of-distribution.
The image tower (UNI2) is locked; performance on tissue types poorly represented in UNI2's pretraining will be lower.

File contents

spatialwhisperer.ckpt — Lightning checkpoint (state_dict + hyper_parameters; optimizer/scheduler state stripped).
README.md — this card.

Citation

@inproceedings{schaefer2026spatialwhisperer,
  title  = {Trimodal Learning Enhances Zero-Shot Histopathology Annotation},
  author = {Schaefer, Moritz and others},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  year   = {2026},
}

License

CC BY-NC 4.0 (research use). Foundation model weights (UNI2, Geneformer, BioBERT) carry their own licenses; please consult upstream repositories.

Downloads last month: -; Downloads are not tracked for this model. How to track