ClinSeek-35B-A3B

ClinSeek-35B-A3B is our open-source model for ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning. We trained it by supervised fine-tuning from Qwen/Qwen3.5-35B-A3B on ClinSeekAgent trajectories generated by Claude Opus 4.6.

ClinSeekAgent studies a clinical reasoning setting where evidence is not handed to the model in a pre-curated prompt. Instead, an agent must actively retrieve patient-specific evidence from raw EHR tables, consult external medical knowledge when needed, and synthesize the acquired evidence into a final decision. ClinSeek-35B-A3B is trained to imitate this long-horizon evidence seeking behavior in native tool-call format.

ClinSeek-35B-A3B performance on AgentEHR-Bench

Release Information

Item Value
Model ClinSeek-35B-A3B
Base model Qwen/Qwen3.5-35B-A3B
Training method Supervised fine-tuning
Teacher model Claude Opus 4.6
Training signal ClinSeekAgent evidence-seeking trajectories
Primary target setting Agentic EHR evidence seeking
Technical report https://arxiv.org/abs/2605.20176
Code https://github.com/UCSC-VLAA/ClinSeekAgent
Benchmark metadata https://huggingface.co/datasets/UCSC-VLAA/ClinSeek-Bench
Project page https://ucsc-vlaa.github.io/ClinSeekAgent/

Training Data And Objective

ClinSeek-35B-A3B validates ClinSeekAgent as a training-time pipeline. Claude Opus 4.6 is used as the teacher model to generate ClinSeekAgent trajectories from the training split of the text-based benchmark. The student model is then fine-tuned with supervised learning on the resulting trajectories.

The trajectories are rendered in native tool-call format with <tool_call> / <tool_response> turns, teaching the model how to search the EHR rather than only imitate final answers.

Training configuration:

Component Configuration
Base model Qwen3.5-35B-A3B
Training objective SFT on ClinSeekAgent trajectories
Training / validation size 7,204 / 147 examples
Maximum sequence length 52,000 tokens
Training epochs 3
Global batch size 32
Micro batch size 1 per GPU
Optimizer Megatron optimizer with CPU offload
Learning rate 2e-5
Minimum learning rate 2e-6
Learning rate schedule Cosine decay with 10 warmup steps
Weight decay 0.1
Gradient clipping 1.0
Precision bfloat16
Backend Megatron + mbridge
Hardware 8 H200 GPUs
Tensor / expert / pipeline parallelism TP=2, EP=8, PP=1
Random seed 42

This release contains the model weights and tokenizer files. It does not redistribute protected clinical source data, patient-level databases, private trajectories, experiment logs, or raw MIMIC-derived records.

Evaluation

We evaluate ClinSeek-35B-A3B on the five-task AgentEHR-Bench setting. The model improves the Qwen3.5-35B-A3B base model from 22.1 to 34.0 average F1, a +11.9 point gain, and achieves the strongest open-source performance among the evaluated models.

Model Diagnoses Labs Microbiology Procedures Transfers Avg.
Qwen3.5-35B-A3B (base) 36.6 17.7 16.2 21.9 18.1 22.1
ClinSeek-35B-A3B 55.4 38.5 27.6 31.7 16.7 34.0
Delta +18.8 +20.8 +11.4 +9.8 -1.4 +11.9

Our analysis shows that the distilled model learns a different tool-use policy, not just a different final-answer prior. On the same 500 AgentEHR-Bench questions, its free-form SQL use increases from 649 calls in the base model to 3,932 calls after SFT, suggesting that ClinSeekAgent trajectories teach the student to treat the EHR as a programmable database.

For full evaluation scripts and benchmark reconstruction instructions, see: https://github.com/UCSC-VLAA/ClinSeekAgent.

Usage

Use the checkpoint with a recent transformers release that supports Qwen3.5-MoE models. For the evaluation setting used in this work, serve the model with an OpenAI-compatible backend such as vLLM and run the ClinSeekAgent evaluation drivers.

Basic loading example:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "UCSC-VLAA/ClinSeek-35B-A3B"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {
        "role": "system",
        "content": "You are a clinical evidence-seeking assistant.",
    },
    {
        "role": "user",
        "content": "Answer the clinical question using the available evidence.",
    },
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=512)

print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

For tool-using evaluation, use the ClinSeekAgent repository rather than a single-turn text generation script. The repository provides the EHR MCP server, tool schemas, prompts, and scoring code expected by this model.

Citation

Please cite our ClinSeekAgent technical report if you use this model:

@article{clinseekagent2026,
  title = {ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning},
  year = {2026},
  url = {https://arxiv.org/abs/2605.20176}
}

Also cite the upstream datasets, benchmarks, and base models used in your experiments, including MIMIC, AgentEHR-Bench, and Qwen3.5-35B-A3B where applicable.

Downloads last month
4
Safetensors
Model size
35B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for UCSC-VLAA/ClinSeek-35B-A3B

Finetuned
(128)
this model

Collection including UCSC-VLAA/ClinSeek-35B-A3B

Paper for UCSC-VLAA/ClinSeek-35B-A3B

Evaluation results