Patho-R1-7B / README.md
WenchuanZhang's picture
Update README.md
7a69eb2 verified
metadata
license: cc-by-nc-nd-4.0
language:
  - en
pipeline_tag: image-text-to-text
library_name: transformers
tags:
  - multimodal
  - Pathology
  - arxiv:2505.11404
extra_gated_prompt: >-
  The Patho-R1-7B model and its associated materials are released under the
  CC-BY-NC-ND 4.0 license.  Access is restricted to non-commercial, academic
  research purposes only, with proper citation required.  Any commercial usage,
  redistribution, or derivative work (including training models based on this
  model or generating datasets from its outputs)  is strictly prohibited without
  prior written approval. 

  Users must register with an official institutional email address (generic
  domains such as @gmail, @qq, @hotmail, etc. will not be accepted).  By
  requesting access, you confirm that your information is accurate and current,
  and that you agree to comply with all terms listed herein.  If other members
  of your organization wish to use the model, they must register independently
  and agree to the same terms.
extra_gated_fields:
  Full name (first and last): text
  Institutional affiliation (no abbreviations): text
  Role/Position:
    type: select
    options:
      - Faculty/Principal Investigator
      - PhD Student
      - Postdoctoral Researcher
      - Research Staff
      - Other
  Official institutional email (**must match your Hugging Face primary email; generic domains will be denied**): text
  Intended research use (be specific): text
  I agree to use this model only for non-commercial academic purposes: checkbox
  I agree not to redistribute this model or share it outside of my individual usage: checkbox
  I confirm that all submitted information is accurate and up to date: checkbox

Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner

[Arxiv] | [Github Repo] | [Cite]

Introduction📝

While vision-language models have shown impressive progress in general medical domains, pathology remains a challenging subfield due to its high-resolution image requirements and complex diagnostic reasoning.

To address this gap, we introduce Patho-R1-7B, a multimodal pathology reasoner designed to enhance diagnostic understanding through structured reasoning. Patho-R1-7B is trained using a three-stage pipeline:

  1. Continued pretraining on 3.5M pathology figure-caption pairs for domain knowledge acquisition
  2. Supervised fine-tuning on 500k expert-annotated Chain-of-Thought samples to encourage reasoning
  3. Reinforcement learning with Group Relative Policy Optimization to refine response quality

Experimental results show that Patho-R1-7B achieves strong performance across key pathology tasks, including multiple choice questions and visual question answering, highlighting its potential for real-world pathology AI applications. Patho-R1 Overview

Quickstart🏃

Here we show a code snippet to show you how to use the chat model with transformers and qwen_vl_utils:

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info


model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "WenchuanZhang/Patho-R1-7B",
    torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("WenchuanZhang/Patho-R1-7B")

# example question from Pathmmu-test-dataset
# ground truth: D
# Reasoning style options (choose one):
# - Chain-of-Draft, a concise reasoning prompting strategy (COD):
# You are a pathology expert, your task is to think step by step, but only keep a minimum draft for each thinking step, with 5 words at most. Return the answer at the end of the response after a separator. Use the following format:<think> Your step-by-step reasoning </think><answer> Your final answer </answer>
# - Chain-of-Thought (COT):
messages = [
    {   "role": "system",
        "content": "You are a pathology expert, your task is to answer question step by step. Use the following format:<think> Your step-by-step reasoning </think><answer> Your final answer </answer>"},
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "./images/example.jpg",
            },
            {"type": "text", "text": "What feature in the provided micrograph is indicative of chronic inflammation? /n A. Granuloma formation /n B. Multinucleated giant cells /n C. Neutrophilic infiltration /n D. Plasma cells with eccentrically placed nuclei"},       
        ],
    }
]
# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=2048)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Acknowledgements🎖

We gratefully acknowledge the contributions of the open-source community, particularly the following projects which laid the foundation for various components of this work:

  • Qwen for providing powerful vision language models that significantly advanced our multimodal understanding and generation capabilities.
  • DocLayout-YOLO for document layout detection.
  • PaddleOCR for comprehensive optical character recognition.
  • ModelScope Swift for efficient model serving and deployment tools.
  • LLaMA-Factory for robust LLM training and fine-tuning pipelines.
  • VERL for valuable visual-language pretraining resources.
  • DeepSeek for high-quality models and infrastructure supporting text understanding.

We thank the authors and contributors of these repositories for their dedication and impactful work, which made our development of Patho-R1-3B possible.

Citation❤️

If you find our work helpful, a citation would be greatly appreciated:

@article{zhang2025patho,
  title={Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner},
  author={Zhang, Wenchuan and Zhang, Penghao and Guo, Jingru and Cheng, Tao and Chen, Jie and Zhang, Shuwan and Zhang, Zhang and Yi, Yuhao and Bu, Hong},
  journal={arXiv preprint arXiv:2505.11404},
  year={2025}
}