💬 Discharge Summary QA

Qwen2.5-3B fine-tuned for discharge summary qa

Part of the Medical AI Fine-tuned Model Suite — 16 specialist models, one per task

TL;DR

Answers specific questions about a patient's hospitalization using only the information in their discharge summary.

INPUT:  DISCHARGE SUMMARY: [72M, CHF admission, discharged on furosemide 80mg, carvedilol 12.5mg BD, sacubitril/valsartan]\n\nQUESTION: What medications was the patient discharged on?
OUTPUT: The patient was discharged on three medications: 1) Furosemide 80mg once daily, 2) Carvedilol 12.5mg twice daily, 3) Sacubitril/Valsartan 24/26mg twice daily.


Base model	unsloth/Qwen2.5-3B-Instruct
Method	QLoRA, 4-bit NF4, rank 16
Training data	discharge-qa-sft — 30,000 real-world rows
Training compute	NVIDIA A6000 (48GB), ~1.5h
License	Apache 2.0

Architecture

                  +-------------------------+
  user prompt --> |  Qwen2.5-3B-Instruct  | --> base weights (frozen, 4-bit NF4)
                  |  + LoRA adapter (r=16)  | --> discharge-qa-qwen25-3b
                  +-------------------------+
                              |
                              v
                     structured output
                  (code / JSON / classification)

This repo contains only the LoRA adapter (~60MB), not the full merged weights. Load it on top of the base model as shown below — this keeps the download small and lets you swap adapters on one base model in memory.

Intended use

Let care teams or patients ask precise questions about a discharge summary instead of reading the entire document.

Direct use

Provide a discharge summary plus a question, get back an answer grounded in that document.

Downstream use

Power a patient-facing portal Q&A widget, or a care-transition checklist generator for receiving facilities.

Out of scope

Answering questions about information not contained in the provided summary — the model is not a general medical knowledge base and should not be asked open clinical questions unrelated to the document.

This model is not a substitute for a certified medical professional's judgment. Output should be reviewed by a qualified person before being used in a clinical or billing decision.

Quickstart

Option A — Transformers + PEFT

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_model = "unsloth/Qwen2.5-3B-Instruct"
adapter    = "AmareshHebbar/discharge-qa-qwen25-3b"

tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(model, adapter)

messages = [
    {"role": "system", "content": "You are a clinical QA assistant. Answer the question based on the discharge summary provided. Be specific and cite relevant details."},
    {"role": "user", "content": "DISCHARGE SUMMARY: [72M, CHF admission, discharged on furosemide 80mg, carvedilol 12.5mg BD, sacubitril/valsartan]\n\nQUESTION: What medications was the patient discharged on?"},
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
outputs = model.generate(inputs, max_new_tokens=128, temperature=0.1, do_sample=True)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))

Expected output:

The patient was discharged on three medications: 1) Furosemide 80mg once daily, 2) Carvedilol 12.5mg twice daily, 3) Sacubitril/Valsartan 24/26mg twice daily.

Option B — Unsloth (2x faster load + inference)

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="AmareshHebbar/discharge-qa-qwen25-3b",
    max_seq_length=512,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

messages = [
    {"role": "system", "content": "You are a clinical QA assistant. Answer the question based on the discharge summary provided. Be specific and cite relevant details."},
    {"role": "user", "content": "QUESTION: What was the primary admitting diagnosis?"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=128, temperature=0.1, do_sample=True)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Option C — vLLM (production serving, OpenAI-compatible)

vllm serve unsloth/Qwen2.5-3B-Instruct \
    --enable-lora \
    --lora-modules discharge-qa-qwen25-3b=AmareshHebbar/discharge-qa-qwen25-3b \
    --host 0.0.0.0 --port 8000 --dtype bfloat16

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="discharge-qa-qwen25-3b",
    messages=[
        {"role": "system", "content": "You are a clinical QA assistant. Answer the question based on the discharge summary provided. Be specific and cite relevant details."},
        {"role": "user", "content": "QUESTION: When is the follow-up appointment scheduled?"},
    ],
    temperature=0.1,
)
print(response.choices[0].message.content)

Option D — GGUF / llama.cpp (CPU / edge inference)

This repo ships LoRA adapter weights, not a pre-merged GGUF. To run on llama.cpp, merge first:

pip install unsloth
python -c "
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained('AmareshHebbar/discharge-qa-qwen25-3b', load_in_4bit=False)
model.save_pretrained_gguf('discharge-qa-qwen25-3b-gguf', tokenizer, quantization_method='q4_k_m')
"

Training details

Data

Trained on 30,000 examples extracted from 30k discharge summaries with structured QA pairs (AGBonnet/augmented-clinical-notes) (source). No synthetic or LLM-generated training data — every example pairs real-world input with its authoritative output.

Split	Rows
Train	24,000
Validation	3,000
Test	3,000

Full extraction pipeline documented on the dataset card.

Hyperparameters

Parameter	Value
LoRA rank (r)	16
LoRA alpha	32
LoRA dropout	0
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Quantization	4-bit NF4 (QLoRA)
Max sequence length	512
Optimizer	paged_adamw_8bit
LR schedule	2e-4, cosine
Gradient checkpointing	Unsloth (smart offload)

Training compute


GPU	NVIDIA A6000 (48GB)
Cloud provider	RunPod
Training time	~1.5h (incl. eval + hub push)
Tracking	W&B run
CO2 estimate	self-reported, not measured with a carbon tracker — treat as approximate

Fine-tuned with Unsloth for 2x faster training and reduced VRAM, using TRL's SFTTrainer. Full project: wandb.ai/amareshhebbar-/axiomapper.

Bias, risks & limitations

Data recency. Training data reflects a specific snapshot in time (CMS FY2026 / dataset publish date). Codes, rates, and rules referenced may become outdated as source authorities issue updates — always cross-check against the live authoritative source before high-stakes use.

Failure mode. Like any LLM, this model can produce a plausible-sounding but incorrect output, especially on rare, ambiguous, or highly compound real-world cases that fall outside the training distribution. It does not know when it's wrong.

Language. English-language input only (Hindi-medical model excepted, where Hindi system prompts are used but underlying clinical reasoning data is largely English-sourced).

Not a regulated medical device. This model has not been validated, cleared, or approved by any regulatory body (FDA, CDSCO, or equivalent) as a medical device or clinical decision support tool. It is a research/engineering artifact.

Misapplication risk. Do not use this model as the sole basis for a clinical, billing, or compliance decision affecting a real patient or claim. Do not deploy in an emergency triage context without a human-in-the-loop and clear escalation paths.

FAQ

Q: Can I merge the adapter into the base model for faster inference? Yes — use model.merge_and_unload() after loading with PEFT, or use Unsloth's save_pretrained_merged() method.

Q: Why QLoRA instead of full fine-tuning? The base model already has strong language and medical knowledge from pretraining. QLoRA adapts only ~0.5-1% of parameters, which is enough to specialize the output format and domain without the cost or overfitting risk of full fine-tuning.

Q: Can I fine-tune this further on my own data? Yes, this adapter can be used as a starting checkpoint for continued fine-tuning. Note this may require merging first depending on your training framework.

Q: Why is the output format so strict? Each task was trained on a fixed system prompt and consistent output structure. Following the documented system prompt closely (see Quickstart above) gives the most reliable results — deviating from it may produce inconsistent formatting.

Q: Does this model store or transmit my input data? No. Like any open-weight model, all inference happens locally on your own infrastructure (or wherever you deploy it) — nothing is sent back to the model author.

Troubleshooting

Symptom	Likely cause	Fix
`ValueError: padding_token not set`	Base tokenizer has no pad token	Set `tokenizer.pad_token = tokenizer.eos_token` before inference
Garbled / repeated output	Wrong chat template applied	Make sure you use `tokenizer.apply_chat_template`, not a raw string prompt
CUDA OOM on load	Insufficient VRAM	Use `load_in_4bit=True` (already default above) or reduce `max_seq_length`
Adapter loads but ignores fine-tuning	Base model mismatch	Confirm you loaded the exact base listed above — adapters are not portable across different base models or quantizations

Related models in this suite

Model	Task	Size
icd10-coder-qwen25-7b	ICD-10-CM medical coding	7B
snomed-mapper-qwen25-7b	Clinical concept mapping	7B
icd10-to-drg-qwen25-1b	ICD-10 to DRG reimbursement	1.5B
pmjay-classifier-qwen25-3b	India PM-JAY classification	3B

Full suite overview: AmareshHebbar/medical-ai-model-suite

Changelog

Version	Date	Notes
v1.0	2026	Initial release — QLoRA fine-tune on 30,000 real-world rows

Citation

@misc{medicalai2026,
  author    = {Hebbar, Amaresh},
  title     = {Medical AI Fine-tuning Suite},
  year      = {2026},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/AmareshHebbar}
}