DATEXIS
/

DeepICD-R1-zero-32B

@@ -2,242 +2,255 @@
 language:
 - en
 license: other
-library_name: transformers
 pipeline_tag: text-generation
 tags:
 - clinical-nlp
-- medical-reasoning
 - icd-10-cm
 - reinforcement-learning
 - grpo
-- qwen2.5
-- diagnosis-prediction
-- chain-of-thought
-- research
 base_model:
 - Qwen/Qwen2.5-32B-Instruct
-model-index:
-- name: DeepICD-R1-zero-32B
-  results: []
 ---
 # DeepICD-R1-zero-32B
-DeepICD-R1-zero-32B is a clinical reasoning model for **ICD-10-CM diagnosis outcome prediction from admission notes**, obtained by applying **Group Relative Policy Optimization (GRPO)** to **Qwen2.5-32B-Instruct**.
-This model corresponds to the **GRPO-only large model** described in the DeepICD-R1 paper, where it serves as the first-stage reasoning model before the creation of the distilled supervised fine-tuning dataset used for smaller downstream models. In the paper, this model is referred to as **DeepICD-R1-zero-32B**. It is part of the broader **DeepICD-R1** framework for hierarchical medical reasoning with verifiable rewards.
-## Relation to the paper
-This repository contains the model corresponding to the **“zero” GRPO-trained large model** in:
-**DeepICD-R1: Medical Reasoning through Hierarchical Rewards and Unsupervised Distillation**
-In the paper, the overall framework has two stages:
-1. A large instruction-tuned model is optimized with **GRPO** using structured clinical rewards.
-   This yields **DeepICD-R1-zero-32B**.
-2. That model is then used to generate reasoning traces which are distilled into a large supervised dataset for smaller models such as DeepICD-R1-7B.
-The paper describes this role as follows: the large base LLM is trained with GRPO and dedicated reward functions to produce **DeepICD-R1-zero-32B**, which is then used in dataset construction and later fine-tuning stages. :contentReference[oaicite:1]{index=1}
-## Model description
-- **Model name:** DeepICD-R1-zero-32B
-- **Base model:** `Qwen/Qwen2.5-32B-Instruct`
-- **Training method:** Reinforcement learning with **GRPO**
-- **Domain:** Clinical NLP
-- **Task:** Predicting the first annotated **ICD-10-CM diagnosis code** from admission notes
-- **Input:** Admission note text
-- **Output:** Structured reasoning plus a predicted ICD-10-CM code
-The model is trained to generate outputs in the following structure:
-```xml
-<think>
-...
-</think>
-<diagnosis>
-...
-</diagnosis>
-```
-The paper uses this structured format together with **hierarchical ICD-aware rewards** and an **LLM-based reasoning reward**.
 ---
 # Intended Use
-This model is intended for:
-- research on clinical reasoning with language models
-- ICD-10-CM outcome prediction from admission notes
-- studying reinforcement learning with verifiable hierarchical rewards
-- generating reasoning traces for analysis or data distillation
-- reproducing or extending the DeepICD-R1 framework
 ---
-# Out-of-Scope Use
-This model is **not intended for**:
-- real-world diagnosis
-- clinical decision support in production
-- autonomous medical coding in care settings
-- unsupervised deployment on patient data
-- use without human oversight
-As emphasized in the paper, this is a **research prototype** and must not be used for real-world diagnosis or clinical decision-making. Generated reasoning may appear plausible while still being clinically incorrect.
 ---
 # Training Data
-The model was trained on **MIMIC-IV admission notes** for single-label prospective ICD-10-CM outcome prediction.
-According to the paper, the task is formulated as predicting the **first annotated diagnosis code from admission-time information**, using MIMIC-IV admission notes and excluding leakage-prone diagnostic and treatment sections.
-PhysioNet Link soon!
----
-# Training Procedure
-This model was trained with the **verl PPO trainer** using **GRPO** as the advantage estimator.
-## Core Setup
-- **Trainer:** `verl.trainer.main_ppo`
-- **Advantage estimator:** `grpo`
-- **Base model:** `Qwen/Qwen2.5-32B-Instruct`
-- **Epochs:** 1
-- **Effective train batch size:** 64
-- **Rollouts per prompt:** 8
-- **Max prompt length:** 2048
-- **Max response length:** 1024
-- **Sampling temperature:** 0.9
-- **Learning rate:** 1e-6
-- **Warmup steps:** 80
-- **Entropy coefficient:** 0.001
-- **KL loss:** disabled
-- **Actor torch compile:** enabled
-- **Gradient checkpointing:** enabled
-- **Rollout engine:** vLLM
-- **Rollout dtype:** bfloat16
 ---
-# Hardware
-- **GPUs:** 8
-- **Nodes:** 1
-- **GPU type:** not explicitly specified in the config
-- **Memory limit:** 512 GiB
----
-# Reward Setup
-Training used a **custom batched reward function** with the following active components:
-- Outcome reward: enabled
-- Format reward: enabled
-- LLM-as-a-judge reward: enabled
-- Judge RAG: enabled
-- Guidelines file: ICD-10-CM chapter guidelines JSON
-- Judge model: `meta-llama/Llama-3.1-8B-Instruct`
-## Selected Reward Environment Settings
-ACTIVATE_OUTCOME_REWARD=True
-ACTIVATE_FORMAT_REWARD=True
-JUDGE_RAG_ENABLED=True
-NO_MATCH_MALUS=-1
-THINK_TRACE_REWARD=1
-MATCH_REWARD=15
-LLM_REWARD_SCALING=0.8
-This aligns with the paper’s training design, which combines:
-- a **format reward** for `<think>` and `<diagnosis>` structure
-- a **hierarchical ICD outcome reward**
-- an **LLM-as-a-judge reward** to improve reasoning clarity and consistency
 ---
-# Prompt / Output Format
-The model expects an **admission note** and is trained to return:
-1. a reasoning trace inside `<think>...</think>`
-2. a predicted ICD-10-CM code inside `<diagnosis>...</diagnosis>`
-## Example Schema
-```xml
-<think>
-Reasoning over presenting symptoms, history, and admission note evidence.
-</think>
-<diagnosis>
-M5116
-</diagnosis>
-```
-Users should validate that generated outputs conform to this format before downstream evaluation.
 ---
-# Evaluation
-In the paper, evaluation is performed on **hierarchical ICD-10-CM prediction tasks** at three levels:
-- **Chapter**
-- **Category**
-- **Full diagnosis code**
-The paper reports that the **GRPO-only 32B model improves over the instruction-tuned baseline**, but remains weaker than models trained with **both supervised fine-tuning and GRPO**, especially for **fine-grained full-code prediction**.
-This repository does **not claim any additional benchmark results** beyond those reported in the paper unless explicitly added later.
----
-# Limitations
-Important limitations discussed in the paper include:
-- reasoning traces may be coherent but **not clinically correct**
-- the model can exhibit **premature diagnostic closure**
-- performance drops on **fine-grained and rare ICD codes**
-- the underlying data reflects **institutional and demographic bias**
-- the model may fail to capture the **severity or clinical significance of diagnoses**
-- reinforcement signals based on **automatic rewards and LLM judging are only proxies for expert review**
-The paper also notes that clinicians often preferred **concise reasoning**, and that plausible-looking outputs may still omit important differential diagnoses.
 ---
-# Ethical Considerations
-- Trained on **de-identified MIMIC-IV data** under the applicable data-use framework
-- **Research-only release**
-- Not suitable for **patient-facing or clinician-facing decision support** without substantial additional validation
-- May propagate **dataset bias and disease-frequency imbalance**
-- Outputs should **not be interpreted as medical advice**
-Please read the paper’s **Ethical Considerations** and **Limitations** sections before using this model.
 ---
-# Citation
-If you use this model, please cite the paper:
 ```bibtex
-@article{roehr2026deepicdr1,
   title={DeepICD-R1: Medical Reasoning through Hierarchical Rewards and Unsupervised Distillation},
-  author={R{\"o}hr, Tom and Steffek, Thomas and Teucher, Roman and Bressem, Keno and Figueroa, Alexei and Grundmann, Paul and Troeger, Peter and Gers, Felix and L{\"o}ser, Alexander},
-  year={2026},
-  journal={Proceedings of LREC-COLING 2026}
 }

 language:
 - en
 license: other
 pipeline_tag: text-generation
 tags:
 - clinical-nlp
+- medical-coding
+- icd10
 - icd-10-cm
+- reasoning
 - reinforcement-learning
 - grpo
+- healthcare
 base_model:
 - Qwen/Qwen2.5-32B-Instruct
 ---
 # DeepICD-R1-zero-32B
+## Model Summary
+**DeepICD-R1-zero-32B** is a clinical reasoning model designed for **ICD-10-CM diagnosis outcome prediction from admission notes**.
+It follows the **DeepICD-R1 framework**, which treats diagnosis prediction as a reasoning task optimized with reinforcement learning and structured reward signals.
+This checkpoint corresponds to a **“R1-Zero” style model**, meaning it was trained primarily through **reinforcement learning without a supervised fine-tuning (SFT) initialization**, allowing reasoning behaviors to emerge directly from reward optimization.
+The approach is inspired by reasoning-focused training pipelines where reinforcement learning alone can induce structured reasoning behaviors and self-verification in large language models.
+---
+# Model Details
+- **Model name:** DeepICD-R1-zero-32B
+- **Organization:** DATEXIS
+- **Model size:** ~32B parameters
+- **Task:** Single ICD-10-CM diagnosis prediction from clinical text
+- **Training paradigm:** Reinforcement learning (GRPO-style)
+- **Framework:** VERL reinforcement learning trainer
+- **Domain:** Clinical NLP / medical reasoning
+### Related Research
+This model follows the **DeepICD-R1** framework introduced in:
+> *DeepICD-R1: Medical Reasoning through Hierarchical Rewards and Unsupervised Distillation*
+The paper proposes a system for diagnosis prediction that combines:
+- structured reasoning traces
+- hierarchical reward signals aligned with ICD code structure
+- reinforcement learning for reasoning optimization
 ---
 # Intended Use
+This model is intended for **research purposes**, including:
+- clinical reasoning experiments
+- ICD-10-CM code prediction research
+- reinforcement learning for language models
+- reasoning trace generation
+- structured prediction from clinical notes
+### Out-of-Scope Use
+This model **must not** be used for:
+- medical diagnosis
+- clinical decision making
+- patient triage
+- automated medical coding without expert supervision
+- billing or compliance workflows
 ---
+# Training Methodology
+## R1-Zero Training Paradigm
+The model follows a **Zero-stage reasoning training approach**, where reinforcement learning is applied directly to a base language model without prior supervised instruction tuning.
+This method encourages models to discover reasoning strategies autonomously during training, allowing behaviors such as:
+- chain-of-thought reasoning
+- self-verification
+- iterative reasoning refinement
+to emerge naturally from the reward signal.
+However, purely RL-trained models may also exhibit issues such as:
+- repetitive reasoning patterns
+- readability problems
+- mixed language outputs
 ---
 # Training Data
+The training task uses **clinical admission notes paired with ICD-10-CM diagnoses**, derived from de-identified electronic health record datasets such as **MIMIC-IV**.
+Task formulation:
+- **Input:** admission note describing a patient case
+- **Output:** reasoning trace and predicted ICD-10-CM code
+The model learns to infer diagnostic outcomes based on the textual description of the patient presentation.
 ---
+# Output Format
+The model is trained to produce structured outputs separating reasoning from the final diagnosis.
+### Example
+```text
+<think>
+The patient presents with ...
+Symptoms and history suggest ...
+...
+</think>
+<diagnosis>
+M5116
+</diagnosis>
+```
+The reasoning trace allows the model to explain how the diagnosis is derived from the clinical note.
+---
+## Evaluation
+Evaluation follows the methodology described in the **DeepICD-R1 paper**.
+Performance is typically measured using **macro-averaged F1 scores** at multiple levels of the ICD hierarchy.
+| Level | Description |
+|------|-------------|
+| Chapter | Broad ICD category |
+| Category | First three digits |
+| Full code | Complete ICD-10 code |
+Hierarchical evaluation allows partial credit when the model predicts the correct high-level diagnostic category even if the full code is incorrect.
 ---
+## Limitations
+Models following the DeepICD-R1 framework share several limitations.
+### Dataset limitations
+- Training data consists primarily of **English clinical notes**
+- Distribution reflects **hospital-specific patient populations**
+- ICD labels are **highly imbalanced**, affecting rare diagnoses
+### Model limitations
+- Reasoning traces may appear convincing while being incorrect
+- Predictions may fail for rare or long-tail diagnoses
+- Models may demonstrate **premature diagnostic closure**
+- Reinforcement learning signals are only proxies for expert feedback
 ---
+## Ethical Considerations
+This model is trained on **de-identified clinical data** and intended strictly for research.
+Potential risks include:
+- propagation of dataset biases
+- overconfidence in generated reasoning
+- misuse in clinical decision making
+Appropriate safeguards include:
+- expert oversight
+- dataset bias evaluation
+- fairness audits
+- controlled deployment environments
+---
+## Hardware and Training Setup
+Typical training configuration for models in this family includes:
+- **GPUs:** multi-GPU training (4–8 GPUs)
+- **Precision:** bfloat16
+- **Rollout engine:** vLLM
+- **Training framework:** VERL PPO/GRPO trainer
+- **Sampling:** multiple rollouts per prompt
 ---
+## Usage
+### Transformers Example
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+model_id = "DATEXIS/DeepICD-R1-zero-32B"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    device_map="auto",
+    torch_dtype="auto"
+)
+prompt = """
+You are a clinical reasoning model.
+Given the following admission note,
+produce reasoning in <think> tags
+and a final ICD-10 diagnosis in <diagnosis> tags.
+[ADMISSION NOTE]
+"""
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+outputs = model.generate(
+    **inputs,
+    max_new_tokens=512,
+)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+## Recommended Inference Practices
+- Use prompts consistent with the training format.
+- Validate predicted ICD-10 codes against official code formats.
+- Always review predictions with medical experts.
+- Avoid exposing reasoning traces in safety-critical settings without verification.
 ---
+## Citation
+If you use this model, please cite:
 ```bibtex
+@inproceedings{roehr2026deepicdr1,
   title={DeepICD-R1: Medical Reasoning through Hierarchical Rewards and Unsupervised Distillation},
+  author={R{\"o}hr, Tom and Steffek, Thomas and Teucher, Roman and Bressem, Keno and others},
+  booktitle={Proceedings of LREC-COLING},
+  year={2026}
 }