toroe commited on
Commit
e10d740
·
verified ·
1 Parent(s): 341c168

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +163 -150
README.md CHANGED
@@ -2,242 +2,255 @@
2
  language:
3
  - en
4
  license: other
5
- library_name: transformers
6
  pipeline_tag: text-generation
7
  tags:
8
  - clinical-nlp
9
- - medical-reasoning
 
10
  - icd-10-cm
 
11
  - reinforcement-learning
12
  - grpo
13
- - qwen2.5
14
- - diagnosis-prediction
15
- - chain-of-thought
16
- - research
17
  base_model:
18
  - Qwen/Qwen2.5-32B-Instruct
19
- model-index:
20
- - name: DeepICD-R1-zero-32B
21
- results: []
22
  ---
23
 
24
  # DeepICD-R1-zero-32B
25
 
26
- DeepICD-R1-zero-32B is a clinical reasoning model for **ICD-10-CM diagnosis outcome prediction from admission notes**, obtained by applying **Group Relative Policy Optimization (GRPO)** to **Qwen2.5-32B-Instruct**.
27
 
28
- This model corresponds to the **GRPO-only large model** described in the DeepICD-R1 paper, where it serves as the first-stage reasoning model before the creation of the distilled supervised fine-tuning dataset used for smaller downstream models. In the paper, this model is referred to as **DeepICD-R1-zero-32B**. It is part of the broader **DeepICD-R1** framework for hierarchical medical reasoning with verifiable rewards.
 
29
 
30
- ## Relation to the paper
31
 
32
- This repository contains the model corresponding to the **“zero” GRPO-trained large model** in:
33
 
34
- **DeepICD-R1: Medical Reasoning through Hierarchical Rewards and Unsupervised Distillation**
35
 
36
- In the paper, the overall framework has two stages:
37
 
38
- 1. A large instruction-tuned model is optimized with **GRPO** using structured clinical rewards.
39
- This yields **DeepICD-R1-zero-32B**.
40
- 2. That model is then used to generate reasoning traces which are distilled into a large supervised dataset for smaller models such as DeepICD-R1-7B.
 
 
 
 
41
 
42
- The paper describes this role as follows: the large base LLM is trained with GRPO and dedicated reward functions to produce **DeepICD-R1-zero-32B**, which is then used in dataset construction and later fine-tuning stages. :contentReference[oaicite:1]{index=1}
43
 
44
- ## Model description
45
 
46
- - **Model name:** DeepICD-R1-zero-32B
47
- - **Base model:** `Qwen/Qwen2.5-32B-Instruct`
48
- - **Training method:** Reinforcement learning with **GRPO**
49
- - **Domain:** Clinical NLP
50
- - **Task:** Predicting the first annotated **ICD-10-CM diagnosis code** from admission notes
51
- - **Input:** Admission note text
52
- - **Output:** Structured reasoning plus a predicted ICD-10-CM code
53
 
54
- The model is trained to generate outputs in the following structure:
55
 
56
- ```xml
57
- <think>
58
- ...
59
- </think>
60
- <diagnosis>
61
- ...
62
- </diagnosis>
63
- ```
64
- The paper uses this structured format together with **hierarchical ICD-aware rewards** and an **LLM-based reasoning reward**.
65
 
66
  ---
67
 
68
  # Intended Use
69
 
70
- This model is intended for:
 
 
 
 
 
 
71
 
72
- - research on clinical reasoning with language models
73
- - ICD-10-CM outcome prediction from admission notes
74
- - studying reinforcement learning with verifiable hierarchical rewards
75
- - generating reasoning traces for analysis or data distillation
76
- - reproducing or extending the DeepICD-R1 framework
 
 
 
 
77
 
78
  ---
79
 
80
- # Out-of-Scope Use
 
 
81
 
82
- This model is **not intended for**:
83
 
84
- - real-world diagnosis
85
- - clinical decision support in production
86
- - autonomous medical coding in care settings
87
- - unsupervised deployment on patient data
88
- - use without human oversight
89
 
90
- As emphasized in the paper, this is a **research prototype** and must not be used for real-world diagnosis or clinical decision-making. Generated reasoning may appear plausible while still being clinically incorrect.
 
 
 
 
 
 
 
 
 
 
91
 
92
  ---
93
 
94
  # Training Data
95
 
96
- The model was trained on **MIMIC-IV admission notes** for single-label prospective ICD-10-CM outcome prediction.
97
- According to the paper, the task is formulated as predicting the **first annotated diagnosis code from admission-time information**, using MIMIC-IV admission notes and excluding leakage-prone diagnostic and treatment sections.
98
- PhysioNet Link soon!
99
 
100
- ---
 
 
 
101
 
102
- # Training Procedure
103
-
104
- This model was trained with the **verl PPO trainer** using **GRPO** as the advantage estimator.
105
-
106
- ## Core Setup
107
-
108
- - **Trainer:** `verl.trainer.main_ppo`
109
- - **Advantage estimator:** `grpo`
110
- - **Base model:** `Qwen/Qwen2.5-32B-Instruct`
111
- - **Epochs:** 1
112
- - **Effective train batch size:** 64
113
- - **Rollouts per prompt:** 8
114
- - **Max prompt length:** 2048
115
- - **Max response length:** 1024
116
- - **Sampling temperature:** 0.9
117
- - **Learning rate:** 1e-6
118
- - **Warmup steps:** 80
119
- - **Entropy coefficient:** 0.001
120
- - **KL loss:** disabled
121
- - **Actor torch compile:** enabled
122
- - **Gradient checkpointing:** enabled
123
- - **Rollout engine:** vLLM
124
- - **Rollout dtype:** bfloat16
125
 
126
  ---
127
 
128
- # Hardware
129
 
130
- - **GPUs:** 8
131
- - **Nodes:** 1
132
- - **GPU type:** not explicitly specified in the config
133
- - **Memory limit:** 512 GiB
134
 
135
- ---
136
 
137
- # Reward Setup
 
 
 
 
 
138
 
139
- Training used a **custom batched reward function** with the following active components:
 
 
 
 
140
 
141
- - Outcome reward: enabled
142
- - Format reward: enabled
143
- - LLM-as-a-judge reward: enabled
144
- - Judge RAG: enabled
145
- - Guidelines file: ICD-10-CM chapter guidelines JSON
146
- - Judge model: `meta-llama/Llama-3.1-8B-Instruct`
147
 
148
- ## Selected Reward Environment Settings
149
 
150
- ACTIVATE_OUTCOME_REWARD=True
151
- ACTIVATE_FORMAT_REWARD=True
152
- JUDGE_RAG_ENABLED=True
153
- NO_MATCH_MALUS=-1
154
- THINK_TRACE_REWARD=1
155
- MATCH_REWARD=15
156
- LLM_REWARD_SCALING=0.8
157
 
 
158
 
159
- This aligns with the paper’s training design, which combines:
 
 
 
 
160
 
161
- - a **format reward** for `<think>` and `<diagnosis>` structure
162
- - a **hierarchical ICD outcome reward**
163
- - an **LLM-as-a-judge reward** to improve reasoning clarity and consistency
164
 
165
  ---
166
 
167
- # Prompt / Output Format
168
 
169
- The model expects an **admission note** and is trained to return:
170
 
171
- 1. a reasoning trace inside `<think>...</think>`
172
- 2. a predicted ICD-10-CM code inside `<diagnosis>...</diagnosis>`
173
 
174
- ## Example Schema
 
 
175
 
176
- ```xml
177
- <think>
178
- Reasoning over presenting symptoms, history, and admission note evidence.
179
- </think>
180
 
181
- <diagnosis>
182
- M5116
183
- </diagnosis>
184
- ```
185
- Users should validate that generated outputs conform to this format before downstream evaluation.
186
 
187
  ---
188
 
189
- # Evaluation
190
 
191
- In the paper, evaluation is performed on **hierarchical ICD-10-CM prediction tasks** at three levels:
192
 
193
- - **Chapter**
194
- - **Category**
195
- - **Full diagnosis code**
196
 
197
- The paper reports that the **GRPO-only 32B model improves over the instruction-tuned baseline**, but remains weaker than models trained with **both supervised fine-tuning and GRPO**, especially for **fine-grained full-code prediction**.
 
 
198
 
199
- This repository does **not claim any additional benchmark results** beyond those reported in the paper unless explicitly added later.
200
 
201
- ---
 
 
 
202
 
203
- # Limitations
204
 
205
- Important limitations discussed in the paper include:
206
 
207
- - reasoning traces may be coherent but **not clinically correct**
208
- - the model can exhibit **premature diagnostic closure**
209
- - performance drops on **fine-grained and rare ICD codes**
210
- - the underlying data reflects **institutional and demographic bias**
211
- - the model may fail to capture the **severity or clinical significance of diagnoses**
212
- - reinforcement signals based on **automatic rewards and LLM judging are only proxies for expert review**
213
 
214
- The paper also notes that clinicians often preferred **concise reasoning**, and that plausible-looking outputs may still omit important differential diagnoses.
 
 
 
 
215
 
216
  ---
217
 
218
- # Ethical Considerations
 
 
 
 
 
219
 
220
- - Trained on **de-identified MIMIC-IV data** under the applicable data-use framework
221
- - **Research-only release**
222
- - Not suitable for **patient-facing or clinician-facing decision support** without substantial additional validation
223
- - May propagate **dataset bias and disease-frequency imbalance**
224
- - Outputs should **not be interpreted as medical advice**
225
 
226
- Please read the paper’s **Ethical Considerations** and **Limitations** sections before using this model.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
227
 
228
  ---
229
 
230
- # Citation
231
 
232
- If you use this model, please cite the paper:
233
 
234
  ```bibtex
235
- @article{roehr2026deepicdr1,
236
  title={DeepICD-R1: Medical Reasoning through Hierarchical Rewards and Unsupervised Distillation},
237
- author={R{\"o}hr, Tom and Steffek, Thomas and Teucher, Roman and Bressem, Keno and Figueroa, Alexei and Grundmann, Paul and Troeger, Peter and Gers, Felix and L{\"o}ser, Alexander},
238
- year={2026},
239
- journal={Proceedings of LREC-COLING 2026}
240
  }
241
 
242
-
243
-
 
2
  language:
3
  - en
4
  license: other
 
5
  pipeline_tag: text-generation
6
  tags:
7
  - clinical-nlp
8
+ - medical-coding
9
+ - icd10
10
  - icd-10-cm
11
+ - reasoning
12
  - reinforcement-learning
13
  - grpo
14
+ - healthcare
 
 
 
15
  base_model:
16
  - Qwen/Qwen2.5-32B-Instruct
 
 
 
17
  ---
18
 
19
  # DeepICD-R1-zero-32B
20
 
21
+ ## Model Summary
22
 
23
+ **DeepICD-R1-zero-32B** is a clinical reasoning model designed for **ICD-10-CM diagnosis outcome prediction from admission notes**.
24
+ It follows the **DeepICD-R1 framework**, which treats diagnosis prediction as a reasoning task optimized with reinforcement learning and structured reward signals.
25
 
26
+ This checkpoint corresponds to a **“R1-Zero” style model**, meaning it was trained primarily through **reinforcement learning without a supervised fine-tuning (SFT) initialization**, allowing reasoning behaviors to emerge directly from reward optimization.
27
 
28
+ The approach is inspired by reasoning-focused training pipelines where reinforcement learning alone can induce structured reasoning behaviors and self-verification in large language models.
29
 
30
+ ---
31
 
32
+ # Model Details
33
 
34
+ - **Model name:** DeepICD-R1-zero-32B
35
+ - **Organization:** DATEXIS
36
+ - **Model size:** ~32B parameters
37
+ - **Task:** Single ICD-10-CM diagnosis prediction from clinical text
38
+ - **Training paradigm:** Reinforcement learning (GRPO-style)
39
+ - **Framework:** VERL reinforcement learning trainer
40
+ - **Domain:** Clinical NLP / medical reasoning
41
 
42
+ ### Related Research
43
 
44
+ This model follows the **DeepICD-R1** framework introduced in:
45
 
46
+ > *DeepICD-R1: Medical Reasoning through Hierarchical Rewards and Unsupervised Distillation*
 
 
 
 
 
 
47
 
48
+ The paper proposes a system for diagnosis prediction that combines:
49
 
50
+ - structured reasoning traces
51
+ - hierarchical reward signals aligned with ICD code structure
52
+ - reinforcement learning for reasoning optimization
 
 
 
 
 
 
53
 
54
  ---
55
 
56
  # Intended Use
57
 
58
+ This model is intended for **research purposes**, including:
59
+
60
+ - clinical reasoning experiments
61
+ - ICD-10-CM code prediction research
62
+ - reinforcement learning for language models
63
+ - reasoning trace generation
64
+ - structured prediction from clinical notes
65
 
66
+ ### Out-of-Scope Use
67
+
68
+ This model **must not** be used for:
69
+
70
+ - medical diagnosis
71
+ - clinical decision making
72
+ - patient triage
73
+ - automated medical coding without expert supervision
74
+ - billing or compliance workflows
75
 
76
  ---
77
 
78
+ # Training Methodology
79
+
80
+ ## R1-Zero Training Paradigm
81
 
82
+ The model follows a **Zero-stage reasoning training approach**, where reinforcement learning is applied directly to a base language model without prior supervised instruction tuning.
83
 
84
+ This method encourages models to discover reasoning strategies autonomously during training, allowing behaviors such as:
 
 
 
 
85
 
86
+ - chain-of-thought reasoning
87
+ - self-verification
88
+ - iterative reasoning refinement
89
+
90
+ to emerge naturally from the reward signal.
91
+
92
+ However, purely RL-trained models may also exhibit issues such as:
93
+
94
+ - repetitive reasoning patterns
95
+ - readability problems
96
+ - mixed language outputs
97
 
98
  ---
99
 
100
  # Training Data
101
 
102
+ The training task uses **clinical admission notes paired with ICD-10-CM diagnoses**, derived from de-identified electronic health record datasets such as **MIMIC-IV**.
 
 
103
 
104
+ Task formulation:
105
+
106
+ - **Input:** admission note describing a patient case
107
+ - **Output:** reasoning trace and predicted ICD-10-CM code
108
 
109
+ The model learns to infer diagnostic outcomes based on the textual description of the patient presentation.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
110
 
111
  ---
112
 
113
+ # Output Format
114
 
115
+ The model is trained to produce structured outputs separating reasoning from the final diagnosis.
 
 
 
116
 
117
+ ### Example
118
 
119
+ ```text
120
+ <think>
121
+ The patient presents with ...
122
+ Symptoms and history suggest ...
123
+ ...
124
+ </think>
125
 
126
+ <diagnosis>
127
+ M5116
128
+ </diagnosis>
129
+ ```
130
+ The reasoning trace allows the model to explain how the diagnosis is derived from the clinical note.
131
 
132
+ ---
 
 
 
 
 
133
 
134
+ ## Evaluation
135
 
136
+ Evaluation follows the methodology described in the **DeepICD-R1 paper**.
 
 
 
 
 
 
137
 
138
+ Performance is typically measured using **macro-averaged F1 scores** at multiple levels of the ICD hierarchy.
139
 
140
+ | Level | Description |
141
+ |------|-------------|
142
+ | Chapter | Broad ICD category |
143
+ | Category | First three digits |
144
+ | Full code | Complete ICD-10 code |
145
 
146
+ Hierarchical evaluation allows partial credit when the model predicts the correct high-level diagnostic category even if the full code is incorrect.
 
 
147
 
148
  ---
149
 
150
+ ## Limitations
151
 
152
+ Models following the DeepICD-R1 framework share several limitations.
153
 
154
+ ### Dataset limitations
 
155
 
156
+ - Training data consists primarily of **English clinical notes**
157
+ - Distribution reflects **hospital-specific patient populations**
158
+ - ICD labels are **highly imbalanced**, affecting rare diagnoses
159
 
160
+ ### Model limitations
 
 
 
161
 
162
+ - Reasoning traces may appear convincing while being incorrect
163
+ - Predictions may fail for rare or long-tail diagnoses
164
+ - Models may demonstrate **premature diagnostic closure**
165
+ - Reinforcement learning signals are only proxies for expert feedback
 
166
 
167
  ---
168
 
169
+ ## Ethical Considerations
170
 
171
+ This model is trained on **de-identified clinical data** and intended strictly for research.
172
 
173
+ Potential risks include:
 
 
174
 
175
+ - propagation of dataset biases
176
+ - overconfidence in generated reasoning
177
+ - misuse in clinical decision making
178
 
179
+ Appropriate safeguards include:
180
 
181
+ - expert oversight
182
+ - dataset bias evaluation
183
+ - fairness audits
184
+ - controlled deployment environments
185
 
186
+ ---
187
 
188
+ ## Hardware and Training Setup
189
 
190
+ Typical training configuration for models in this family includes:
 
 
 
 
 
191
 
192
+ - **GPUs:** multi-GPU training (4–8 GPUs)
193
+ - **Precision:** bfloat16
194
+ - **Rollout engine:** vLLM
195
+ - **Training framework:** VERL PPO/GRPO trainer
196
+ - **Sampling:** multiple rollouts per prompt
197
 
198
  ---
199
 
200
+ ## Usage
201
+
202
+ ### Transformers Example
203
+
204
+ ```python
205
+ from transformers import AutoTokenizer, AutoModelForCausalLM
206
 
207
+ model_id = "DATEXIS/DeepICD-R1-zero-32B"
 
 
 
 
208
 
209
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
210
+ model = AutoModelForCausalLM.from_pretrained(
211
+ model_id,
212
+ device_map="auto",
213
+ torch_dtype="auto"
214
+ )
215
+
216
+ prompt = """
217
+ You are a clinical reasoning model.
218
+
219
+ Given the following admission note,
220
+ produce reasoning in <think> tags
221
+ and a final ICD-10 diagnosis in <diagnosis> tags.
222
+
223
+ [ADMISSION NOTE]
224
+ """
225
+
226
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
227
+
228
+ outputs = model.generate(
229
+ **inputs,
230
+ max_new_tokens=512,
231
+ )
232
+
233
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
234
+ ```
235
+
236
+ ## Recommended Inference Practices
237
+
238
+ - Use prompts consistent with the training format.
239
+ - Validate predicted ICD-10 codes against official code formats.
240
+ - Always review predictions with medical experts.
241
+ - Avoid exposing reasoning traces in safety-critical settings without verification.
242
 
243
  ---
244
 
245
+ ## Citation
246
 
247
+ If you use this model, please cite:
248
 
249
  ```bibtex
250
+ @inproceedings{roehr2026deepicdr1,
251
  title={DeepICD-R1: Medical Reasoning through Hierarchical Rewards and Unsupervised Distillation},
252
+ author={R{\"o}hr, Tom and Steffek, Thomas and Teucher, Roman and Bressem, Keno and others},
253
+ booktitle={Proceedings of LREC-COLING},
254
+ year={2026}
255
  }
256