SEER: Skill-Evolving Grounded Reasoning for Free-Text Promptable 3D Medical Image Segmentation

📊 Model & Dataset Description

SEER is a vision-language reasoning model designed for robust free-text promptable 3D medical image segmentation. It grounds clinical language in image evidence, evolves reusable reasoning skills, and produces an executable target specification for a downstream segmentation backbone.

The model follows a structured SEER reasoning format:

<evidence>
Image-grounded observations about the visible anatomical or pathological target.
</evidence>

<rationale>
The selected reasoning skill, if a skill bank is provided, followed by the reasoning process that maps the raw clinical request and image evidence to a normalized target specification.
</rationale>

<answer>
The executable target specification for downstream segmentation backbone.
</answer>

SEER-Trace is the grounded reasoning dataset used to train the SEER model. It is curated from established 3D medical segmentation benchmarks and augments each case with clinician-like free-text requests and structured reasoning traces.

The released SEER-Trace split is intended for evaluation and covers two settings:

Dataset	Modality	Evaluation Role
BrainMetShare	MRI	Partial-OOD / domain-shift evaluation: brain anatomy is within the seen anatomical domain, while the institutional sources and target labels are outside the training coverage.
PENGWIN	CT	Strict OOD evaluation: both the pelvic anatomy and pelvic bone target labels are absent from SEER-Trace reasoning supervision and target coverage.

✨ Key Features

Free-text clinical request robustness: Handle linguistic variations such as synonyms, abbreviations, and high-level clinical intent descriptions
Multi-modality support: Works across CT and MRI imaging modalities
Image-grounded reasoning: Identifies image-grounded evidence and uses it to resolve the clinical request
Evolving reasoning skills: Distills high-reward reasoning traces into reusable skills and continuously updates the skill bank according to each skill’s utility
Backbone-independent gains: Shows consistent robustness improvements across different downstream 3D segmentation backbones

🧩 Versions

We release multiple SEER versions (continuously updated) to enable both reproducible research and high-performance downstream applications.

SEER v1.1 (Recommended)

Info: Recommended default version
Contents: SEER-Trace v1.1 and corresponding model weights (LoRA weights)
Training Scale: Trained on all datasets from the paper and additional sources (~33,714 traces in total)
Fine-tuning: LoRA fine-tuning, enabling efficient adaptation while preserving the general capabilities of the Qwen3-VL backbone
Use Case: Recommended for general inference, downstream integration. This version maximizes supervision and concept coverage for stronger general-purpose performance

SEER v1.0 (Deprecated)

Info: This version was used for the experiments in the paper but contains known issues that have been fixed in v1.1. It is not recommended for general use.
Contents: SEER-Trace v1.0 and corresponding model weights (full weights)
Training Scale: Trained on original datasets (~22,330 traces in total)
Fine-tuning: Full-parameter fine-tuning
Use Case: Reproducibility of the results reported in the paper

⚠️ Usage Instructions

This release contains only the VLM reasoning weights. 3D segmentation backbones, such as VoxTell or MedSAM3, should be integrated separately.

Please refer to our official GitHub repository for detailed instructions on environment setup, weight loading, and inference.

GitHub Repository: SEER on GitHub
Paper: ArXiv

🩺 Ethical Considerations

Medical image models can produce plausible but incorrect explanations. Users should treat outputs as research results, not clinical conclusions. Do not use this model to replace professional medical judgment.

📚 Citation

@InProceedings{zhang2026seer,
      author    = { Zhang, Tongrui and Wang, Chenhui and Li, Yongming and Chen, Zhihao and Zhan, Xufeng and Shan, Hongming},
      title     = { Skill-Evolving Grounded Reasoning for Free-Text Promptable 3D Medical Image Segmentation },
      booktitle = { Medical Image Computingand Computer Assisted Intervention },
      year      = { 2026 }
}

Downloads last month: 20

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Ztrura/SEER

Base model

Qwen/Qwen3-VL-4B-Instruct

Finetuned

(303)

this model

Paper for Ztrura/SEER

Skill-Evolving Grounded Reasoning for Free-Text Promptable 3D Medical Image Segmentation

Paper • 2603.08215 • Published Mar 9 • 1