Instructions to use raidium/Jolia with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use raidium/Jolia with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="raidium/Jolia", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("raidium/Jolia", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
Jolia โ A 3D CT foundation model with anatomical representations
Jolia is a 3D CT foundation model that encodes images into vector representations program. It encodes a whole 3D CT volume into:
- a global embedding (
embed_dim = 576), and - per-organ embeddings โ 102 named organ slots produced by organ-query cross-attention pooling, trained to align with per-organ report text.
Installation
pip install torch transformers timm einops numpy safetensors
Quick start
import torch
from transformers import AutoModel
model = AutoModel.from_pretrained("raidium/Jolia", trust_remote_code=True).eval()
# image: a preprocessed CT volume, shape (B, 11, 192, 192, 192) โ see Preprocessing
with torch.no_grad():
cls = model(image).pooler_output # (B, 576) global embedding
Preprocessing
Raw CT volumes must be brought to the Atlas input format
((11, 192, 192, 192): 1.5 mm isotropic, 192ยณ crop, 11 CT windowing channels).
Grab the bundled preprocessor from the repo:
from huggingface_hub import snapshot_download
import sys
repo = snapshot_download("raidium/Jolia")
sys.path.append(repo)
from preprocessing_jolia import JoliaPreprocessor
pre = JoliaPreprocessor()
# volume: (H, W, D) in Hounsfield units; resolution in mm (row, col, slice)
image = pre(volume, resolution=(0.7, 0.7, 1.0)).unsqueeze(0) # (1, 11, 192, 192, 192)
Working with organ queries (the easy way)
Per-organ embeddings are addressed by name
# All 102 organs as {name: (B, 576)}
organs = model.encode_organs(image)
# A subset, L2-normalized (cosine-ready)
sub = model.encode_organs(image, organs=["liver", "spleen", "pancreas"], normalize=True)
print(model.organ_slot_names) # the 102 available organ names
For linear probing, the concatenated normalized feature is one call:
flat = model.extract_flat_feature(image) # (B, 576 * (1 + num_organs))
Zero-shot classification
Jolia ships with the CLIP text-projection head it was trained with. Pair it
with the text encoder Jolia was trained against (Qwen/Qwen3-Embedding-8B)
to classify a CT against arbitrary text prompts with no fine-tuning.
The text encoder is the heavy piece (~18 GB), so loading it is opt-in.
Jolia bundles a small helper, JoliaTextEncoder, that handles tokenization
and the (attention-mask-aware) last-token pooling the model was trained with.
import sys, torch
from huggingface_hub import snapshot_download
from transformers import AutoModel
# 1) Vision: Jolia from the Hub (self-contained, ~89 MB).
jolia = AutoModel.from_pretrained("raidium/Jolia", trust_remote_code=True).eval()
# 2) Text: Qwen3-Embedding-8B + Jolia's bundled JoliaTextEncoder helper.
repo = snapshot_download("raidium/Jolia"); sys.path.append(repo)
from text_encoder_jolia import JoliaTextEncoder
text_encoder = JoliaTextEncoder.from_pretrained(
"Qwen/Qwen3-Embedding-8B",
dtype=torch.bfloat16, # ~18 GB at fp32; bf16 halves it
device_map="auto", # or .to("cuda")
).eval()
# 3) Zero-shot classification on a preprocessed CT volume.
prompts = ["a CT showing a liver lesion", "a CT showing pneumonia", "a normal abdominal CT"]
with torch.no_grad():
text_features = text_encoder(prompts) # (N, 4096) last-token-pooled
logits = jolia.zero_shot(image, text_features) # (B, N) โ calibrated CLIP logits
probs = torch.sigmoid(logits) # per-pair "is this a match?" probability
# Same output as `MultimodalCLSZeroShotCLIP.get_logits_per_image` in rarm.
# Pass `calibrated=False` if you want raw cosine in [-1, 1] (ranking-only):
cosine = jolia.zero_shot(image, text_features, calibrated=False)
Per-organ (query-routed) zero-shot
Jolia also ships the ParallelOrganCLIP text head it was trained against the per-organ findings of each report. This routes a text prompt to one specific organ's query embedding โ useful when you want to ask "is there a lesion in the liver?" rather than scoring against the whole-volume CLS.
text_features = text_encoder(["a lesion", "looks normal"]) # (N, 4096)
# Score N prompts against a single organ โ calibrated CLIP logits (B, N)
liver_logits = jolia.zero_shot_organ(image, text_features, organ="liver")
liver_probs = torch.sigmoid(liver_logits)
# Score N prompts against many organs at once -> {organ_name: (B, N)}
scores = jolia.zero_shot_organs(
image, text_features, organs=["liver", "spleen", "kidneys", "pancreas"]
)
# Raw cosine if you only need ranking and don't want the bias offset:
cosine = jolia.zero_shot_organ(image, text_features, organ="liver", calibrated=False)
Each organ has its own trained temperature and bias (the
(200,)-shaped organ_logit_scale / organ_text_bias), automatically applied
when calibrated=True. jolia.organ_slot_names lists the 102 organs that can
be routed. The per-organ head uses a different text projection than the
global one (encode_text vs encode_organ_text), trained on per-organ
findings text.
A runnable, self-contained script is bundled as example_zero_shot.py.
Model details
| Backbone | MultiModalAtlas โ multi-scale 3D ViT, dim=192, heads 6, stages [2, 2, 8] |
| Patch embed | 6ร6ร6, 11 input channels (CT windowing), merge_ratio = 4ยณ |
| Global embedding | 576-d |
| Organ queries | 102 slots ร 192-d ร 3 scales โ 576-d |
| Parameters | ~22 M (89 MB safetensors) |
| Input | (B, 11, 192, 192, 192) float32 |
| Training data | INSPECT, CT-RATE, Stanford-Abdominal-CT (chest + abdomen CT) |
| Objectives | Volumeโreport CLIP + per-organ ParallelOrganCLIP |
| Paired text encoder | Qwen/Qwen3-Embedding-8B (last-token pooling, context length 512) |
| Global text projection | Linear 4096 โ 576 (+ scalar temperature + bias) โ global CLIP head |
| Per-organ text projection | Linear 4096 โ 576 (+ per-organ temperature + bias, both (200,)) โ ParallelOrganCLIP head |
The 102 organ-slot names are the alphabetically-sorted union of per-organ
report sections across the training datasets; slots 102โ199 are unused
padding. Methods like encode_organs expose only the named slots.
Outputs
model(image) returns a JoliaOutput with:
pooler_outputโ(B, 576)global embedding,organ_queriesโ(B, num_organs, 576), populated when called withoutput_organ_queries=True.
Intended use & limitations
โ ๏ธ Research preview. Not a medical device; not for clinical use.
Jolia is a feature extractor for downstream radiology tasks (classification, retrieval, per-organ analysis) via linear probing or fine-tuning. It is trained on adult chest/abdominal CT and will not generalize to other modalities or unusual acquisition protocols. It does not produce diagnoses and must not be used for clinical decision-making.
Citation
@misc{raidium_jolia,
title = {Jolia: a 3D CT Atlas foundation model with per-organ queries},
author = {Raidium},
year = {2026},
howpublished = {\url{https://huggingface.co/raidium/Jolia}}
}
- Downloads last month
- 22