RadGrounder
RadGrounder is a PaliGemma‑2 (3B)–based multi‑task vision–language model for radiology that jointly performs report generation, visual question answering, and spatial grounding (bounding‑box detection or segmentation) on CT and MR 2D slices, in English and German. It is trained on RefRad2D, a large‑scale bilingual corpus derived from clinical routine with grounding labels generated automatically (TotalSegmentator + LLM‑based keyword extraction); the clinical dataset itself is not released.
This repository holds the released checkpoints.
- 🌐 Project page: https://radgrounder.github.io
- 💻 Code, training, and evaluation: https://github.com/lmb-freiburg/RadGrounder
Contents
| Subfolder | What it is | Grounding |
|---|---|---|
detection/ |
RadGrounder, token‑based bounding‑box grounding | boxes generated as text tokens |
segmentation/ |
RadGrounder + lightweight mask decoder | <seg> spans → binary masks |
siglip/ |
the fine‑tuned SigLIP vision encoder (.ckpt) |
— (only needed to train new models; the two checkpoints above already embed their encoder) |
Both checkpoints use a frozen fine‑tuned SigLIP vision encoder. Each model folder also
contains its training_config.json (the exact training recipe).
Loading
hf download lmb-freiburg/radgrounder --local-dir models # or --include "detection/*"
The detection model is a stock PaliGemma‑2 and loads with vanilla transformers. The
segmentation model uses a custom GroundedGemmaForConditionalGeneration class, so load it
with the RadGrounder code (run from
radgrounder/grounded_gemma/, as the repo's run_eval_*.sh do):
# detection — stock transformers
from transformers import PaliGemmaForConditionalGeneration
m = PaliGemmaForConditionalGeneration.from_pretrained("models/detection")
# segmentation — needs the RadGrounder code on the path
from modeling_groundedgemma import GroundedGemmaForConditionalGeneration
m = GroundedGemmaForConditionalGeneration.from_pretrained("models/segmentation")
Results (external benchmarks)
On Slake and VQA‑RAD, RadGrounder is competitive with specialized medical VLMs (e.g. Slake F1 87.7 / closed accuracy 90.3; highest VQA‑RAD open F1 among compared methods), while adding verifiable spatial grounding at no cost to VQA/report quality. See the paper for full tables and 95% bootstrap confidence intervals.
License
Derived from Google's PaliGemma‑2 / Gemma; use is governed by the Gemma license. Released for research.
Citation
Ging, Salcan, Schirrmeister, Arnold, Kotter, Bozorgtabar, Brox. Scalable Training of Spatially Grounded 2D Vision–Language Models for Radiology.
(Equal contribution: S. Ging, Y. Salcan.)