SpatialWhisperer
SpatialWhisperer is a trimodal embedding model that aligns hematoxylin & eosin (H&E) image patches, gene-expression profiles, and free-text descriptions into a shared 2048-dimensional space. It enables zero-shot cell-type annotation of H&E patches and natural-language querying over histopathology and spatial transcriptomics data.
This repository hosts the main checkpoint (seed=0) from the ICML 2026 paper Trimodal Learning Enhances Zero-Shot Histopathology Annotation (anonymized name \ourmethod).
Model architecture
Three encoders project into a shared embedding space:
| Modality | Encoder | Freezing |
|---|---|---|
| Image (H&E) | UNI2 | locked |
| Transcriptome | Geneformer (12L) | locked |
| Text | BioBERT v1.1 | unfrozen |
Following LiT convention, the freezing pattern is LUL (image locked, text unlocked, transcriptome locked). Only the text tower and the three projection heads are trained. Projection dimension is 2048.
Training data
Three paired datasets cover the three modality pairs:
- HEST-1K — H&E ↔ spatial gene expression (Visium-style spots)
- cellxgene_census — gene expression ↔ free-text cell/sample metadata
- ARCHS4/GEO — gene expression ↔ free-text sample descriptions
Training was 4 epochs with AdamW at learning rate 1e-5 and cosine schedule (warmup 3%), batch size 512, on a single H100 GPU. This checkpoint reflects epoch 3, global step 14624.
Evaluation
Reported AUROC on cell-type benchmarks (mean across cell types):
| Benchmark | SpatialWhisperer | Best published baseline | Δ rel. |
|---|---|---|---|
| PathoCell | 0.630 | 0.554 | +13.7% |
| Lizard | (see paper) | — | +15.9% |
| PanNuke | (see paper) | — | +13.7% |
Modality-pair benchmarks (Tabula Sapiens, HEST-1K, Skin Conditions) confirm the trimodal model retains per-pair performance under low-n subsampling. See the paper for full numbers.
How to use
The checkpoint is a stripped Lightning state-dict (~505 MB, 236 tensors covering the trained BioBERT text tower and the three 2048-d projection heads) plus its hyper_parameters block. Foundation model weights are NOT included — the locked UNI2 image encoder and locked Geneformer transcriptome encoder are re-instantiated at load time from their original providers (and remain under their respective licenses). The ckpt's hyper_parameters.model_config.use_cache = True flag triggers the FrozenCachedModel wrapping that excludes the locked towers from state_dict during load.
Loading requires the cellwhisperer code at https://github.com/moritzschaefer/spatialwhisperer (model code) and the foundation models (UNI2, Geneformer, BioBERT v1.1), which are downloaded by the cellwhisperer setup scripts.
from cellwhisperer.utils.model_io import load_cellwhisperer_model
model, tokenizer, transcriptome_proc, image_proc = load_cellwhisperer_model(
model_path="hf://moritzschaefer/spatialwhisperer"
)
# model is a TranscriptomeTextDualEncoderLightning in eval mode
While the repo is private, export a token first:
export HUGGINGFACE_TOKEN=$(pass api_keys/huggingface_write) # or any read token with access
To compute image–text similarities for zero-shot cell-type annotation, encode patches and class-name strings, then take cosine similarity in the shared 2048-d space. See examples/zero_shot_celltype.py in the model code repository.
Intended use & limitations
Intended. Research on multimodal histopathology, cell-type annotation, spatial transcriptomics analysis, and natural-language querying over H&E and gene-expression data.
Not intended. Clinical diagnosis or treatment decisions. The model was trained on academic datasets and is not validated for clinical use.
Known limitations.
- Trained on Visium-scale spots (~55 μm); finer-grained image–expression alignment is not guaranteed.
- BioBERT vocabulary constrains the text tower; rare technical terms may be out-of-distribution.
- The image tower (UNI2) is locked; performance on tissue types poorly represented in UNI2's pretraining will be lower.
File contents
spatialwhisperer.ckpt— Lightning checkpoint (state_dict + hyper_parameters; optimizer/scheduler state stripped).README.md— this card.
Citation
@inproceedings{schaefer2026spatialwhisperer,
title = {Trimodal Learning Enhances Zero-Shot Histopathology Annotation},
author = {Schaefer, Moritz and others},
booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
year = {2026},
}
License
CC BY-NC 4.0 (research use). Foundation model weights (UNI2, Geneformer, BioBERT) carry their own licenses; please consult upstream repositories.