RadGrounder

RadGrounder is a PaliGemma‑2 (3B)–based multi‑task vision–language model for radiology that jointly performs report generation, visual question answering, and spatial grounding (bounding‑box detection or segmentation) on CT and MR 2D slices, in English and German. It is trained on RefRad2D, a large‑scale bilingual corpus derived from clinical routine with grounding labels generated automatically (TotalSegmentator + LLM‑based keyword extraction); the clinical dataset itself is not released.

This repository holds the released checkpoints.

Contents

Subfolder What it is Grounding
detection/ RadGrounder, token‑based bounding‑box grounding boxes generated as text tokens
segmentation/ RadGrounder + lightweight mask decoder <seg> spans → binary masks
siglip/ the fine‑tuned SigLIP vision encoder (.ckpt) — (only needed to train new models; the two checkpoints above already embed their encoder)

Both checkpoints use a frozen fine‑tuned SigLIP vision encoder. Each model folder also contains its training_config.json (the exact training recipe).

Loading

hf download lmb-freiburg/radgrounder --local-dir models   # or --include "detection/*"

The detection model is a stock PaliGemma‑2 and loads with vanilla transformers. The segmentation model uses a custom GroundedGemmaForConditionalGeneration class, so load it with the RadGrounder code (run from radgrounder/grounded_gemma/, as the repo's run_eval_*.sh do):

# detection — stock transformers
from transformers import PaliGemmaForConditionalGeneration
m = PaliGemmaForConditionalGeneration.from_pretrained("models/detection")

# segmentation — needs the RadGrounder code on the path
from modeling_groundedgemma import GroundedGemmaForConditionalGeneration
m = GroundedGemmaForConditionalGeneration.from_pretrained("models/segmentation")

Results (external benchmarks)

On Slake and VQA‑RAD, RadGrounder is competitive with specialized medical VLMs (e.g. Slake F1 87.7 / closed accuracy 90.3; highest VQA‑RAD open F1 among compared methods), while adding verifiable spatial grounding at no cost to VQA/report quality. See the paper for full tables and 95% bootstrap confidence intervals.

License

Derived from Google's PaliGemma‑2 / Gemma; use is governed by the Gemma license. Released for research.

Citation

Ging, Salcan, Schirrmeister, Arnold, Kotter, Bozorgtabar, Brox. Scalable Training of Spatially Grounded 2D Vision–Language Models for Radiology.

(Equal contribution: S. Ging, Y. Salcan.)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support