Instructions to use vineetdaniels/NYXMed-V17-Model with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use vineetdaniels/NYXMed-V17-Model with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("vineetdaniels/NYXMed-V16-Model") model = PeftModel.from_pretrained(base_model, "vineetdaniels/NYXMed-V17-Model") - Notebooks
- Google Colab
- Kaggle
NYXMed V17 โ Radiology Medical Coding LLM
Llama-3-70B fine-tune for autonomous CPT, ICD-10, and modifier coding from radiology reports.
V16 (base) ยท V17 (this model) ยท V17 Epoch-1 (frozen checkpoint)
TL;DR
V17 is a LoRA adapter trained on top of vineetdaniels/NYXMed-V16-Model (a Llama-3-70B fine-tune). It was trained on 113,032 coder-reviewed radiology cases with a focus on raising ICD-10 accuracy without regressing CPT or modifier performance.
| Metric | V16 (base) | V17 (this) | ฮ |
|---|---|---|---|
| CPT exact match | ~85% | 90.6% | +5.6 pts |
| Modifier exact match | ~95% | 97.0% | +2.0 pts |
| Mean ICD recall | ~65% | 83.4% | +18.4 pts |
| Final eval_loss | ~0.25 | 0.0824 | โ67% |
| Train examples | ~67K | 113,032 | +69% |
| Adds Exam Description + Reason | โ | โ | โ |
V17 was trained to push ICD recall above 80% without regressing CPT โ both goals achieved. Full metric breakdown in Evaluation below.
What V17 is for
V17 takes a radiology report and outputs the billing codes a human coder would assign:
Input: Exam description, reason for exam, full report text, and
(optionally) retrieval-augmented examples + candidate codes.
Output: CPT[, CPT2], MOD, ICD1, ICD2, ICD3, ...
e.g. 93970, 26, M79.89, I83.93
It is designed to be the LLM core inside an autonomous coding pipeline with retrieval (RAG), post-processing rules, and audit feedback loops โ not as a standalone end-user model.
Targets the model predicts
- CPT-4 procedure codes (supports multi-code outputs)
- Modifier-26 / TC / LT / RT / 50 / 59 / โฆ
- ICD-10-CM diagnosis codes (multi-label, ordered by clinical priority)
Evaluation
Internal evaluation is performed against the live coder-reviewed Supabase dataset using a held-out validation split of 5,950 records.
Training-time eval_loss curve (held-out 250-sample slice)
| Epoch | Step | eval_loss |
|---|---|---|
| 0.03 | 100 | (โ baseline) |
| 0.42 | 1,500 | 0.144 |
| 0.85 | 3,000 | 0.103 |
| 0.99 | 3,500 | 0.0875 |
| 1.02 | 3,600 โ best | 0.0824 |
| 1.10 | 3,900 (stopped) | 0.0841 |
Early stopping triggered at step 3,900 (1.1 epochs); load_best_model_at_end=True reverted to the step-3,600 checkpoint.
Domain-specific accuracy
Measured on n = 500 randomly sampled held-out radiology reports (greedy decoding, batch=4, 4รH200):
| Metric | V17 |
|---|---|
| CPT exact match | 90.60% |
| Primary CPT match | 91.40% |
| Modifier exact match | 97.00% |
| ICD-10 exact match (full set) | 69.60% |
| ICD-10 any-overlap | 90.40% |
ICD-10 root-overlap (A99.x-level) |
92.20% |
| Mean ICD recall | 83.37% |
| Mean ICD precision | 85.05% |
| All-three exact (CPT + MOD + full ICD set) | 64.00% |
V17's primary training objective โ raise ICD recall above 80% โ was met (83.37%) while CPT (90.6%) and Modifier (97.0%) far exceeded the no-regression floor. Code-set-overlap metrics show V17 is identifying the correct family of ICD codes 92% of the time, with most remaining errors being specificity refinements (e.g. predicting M25.5 instead of M25.511) rather than wrong-diagnosis errors.
How to use
V17 is published as a LoRA adapter. You need the V16 base model alongside it.
Option A โ Transformers + PEFT
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
BASE = "vineetdaniels/NYXMed-V16-Model"
ADAPTER = "vineetdaniels/NYXMed-V17-Model"
tokenizer = AutoTokenizer.from_pretrained(ADAPTER, use_fast=True)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
base = AutoModelForCausalLM.from_pretrained(
BASE,
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="sdpa",
)
model = PeftModel.from_pretrained(base, ADAPTER)
model.eval()
messages = [
{"role": "system", "content": "You are an expert radiology coder specializing in ICD-10 and CPT coding for radiology reports.\n\nFollow the coding rules provided in each request carefully."},
{"role": "user", "content": "<your prompt with few-shot examples + report>"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.inference_mode():
out = model.generate(**inputs, max_new_tokens=64, do_sample=False)
print(tokenizer.decode(out[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Option B โ Merge & deploy with vLLM
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
base = AutoModelForCausalLM.from_pretrained("vineetdaniels/NYXMed-V16-Model", torch_dtype=torch.bfloat16)
merged = PeftModel.from_pretrained(base, "vineetdaniels/NYXMed-V17-Model").merge_and_unload()
merged.save_pretrained("./nyxmed-v17-merged")
AutoTokenizer.from_pretrained("vineetdaniels/NYXMed-V17-Model").save_pretrained("./nyxmed-v17-merged")
Then serve with vLLM:
vllm serve ./nyxmed-v17-merged \
--dtype bfloat16 \
--tensor-parallel-size 4 \
--max-model-len 4096
Generation settings (recommended)
| Param | Value |
|---|---|
do_sample |
False (greedy) |
max_new_tokens |
64 |
temperature |
n/a |
top_p |
n/a |
Greedy decoding gives the most reproducible coding output. The model is robust enough that sampling rarely helps.
Prompt format
V17 expects the Llama-3 chat template. The user message should contain (in order):
- Few-shot examples retrieved by RAG (BM25 + FAISS + reranker)
- CPT candidate list (top-K from RAG, ordered)
- ICD-10 candidate list (top-K from RAG, ordered)
- Coding rules (project-specific guardrails)
- The actual report, in one of two formats (V17 was trained on both, ~70/30 split):
- Explicit: separate
Exam Description:andReason for Exam:lines, then the body. - Embedded: report text only, with description/indication inline as in the source.
- Explicit: separate
The expected assistant output is a single line:
<CPT>[ <CPT2> ...], <MOD>, <ICD1>, <ICD2>, ...
Empty modifier slot is allowed (e.g. 74176, , R10.84).
Training details
| Setting | Value |
|---|---|
| Base model | vineetdaniels/NYXMed-V16-Model (Llama-3-70B-Instruct fine-tune) |
| Method | LoRA with DeepSpeed ZeRO-3 |
LoRA rank (r) |
64 |
| LoRA alpha | 128 |
| LoRA dropout | 0.05 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Trainable params | ~417 M of ~70.9 B (0.59%) |
| Train examples | 113,032 |
| Validation examples | 250 (sampled from 5,950-record held-out pool) |
| Sequence length | 2,560 tokens |
| Effective batch size | 32 (per-device 1 ร grad accum 8 ร 4 GPUs) |
| Optimizer | AdamW + DeepSpeed ZeRO-3 |
| Learning rate | 1e-5 (cosine schedule, 3% warmup) |
| Epochs | 2 (early stopped at 1.10) |
| Total steps | 3,900 |
| Best step | 3,600 (loaded back via load_best_model_at_end) |
| Attention impl. | sdpa (PyTorch built-in Flash Attention 2) |
| Precision | bfloat16 |
| Hardware | 4 ร NVIDIA H200 SXM 80GB |
| Wall-clock runtime | 16.95 hours |
Data composition
| Source | Count | Notes |
|---|---|---|
| Supabase coder-reviewed cases | ~46,000 | Includes 30K+ new records collected after V16 |
| Specificity-correction pairs | ~5,000 | Unspecified โ specific ICD upgrades, 3ร weighted |
| Hard-case audit set | ~3,000 | Multi-code or modifier-heavy reports |
| V16-era retained set | ~59,000 | Filtered to exclude records V16 already trained on |
A 3-layer self-leakage defense (content hash + cosine similarity + metadata fingerprint) prevented any training record from retrieving itself as a few-shot example during prompt assembly. 108K candidate retrievals were blocked by this filter during training-data preparation.
Intended use & limitations
Intended use
- Augmenting human radiology coders in a review-then-accept workflow.
- Pre-coding reports for a downstream audit / verification pipeline.
- Research on LLM-based medical coding.
Out of scope
- Direct billing without human review.
- Non-radiology specialties (cardiology, pathology, etc.). The training data is radiology-only.
- ICD-10 codes outside the radiology-relevant subset are under-represented.
Known limitations
- Long reports (> 2,560 tokens) are truncated during inference; performance on extreme outliers may degrade.
- Rare CPT/ICD combinations appear infrequently in training and remain harder cases.
- The model is English-only.
- Outputs are deterministic with greedy decoding but the model can still produce hallucinated codes โ production deployment must include code-validity checks against the official CMS code sets.
Bias & safety
This is a clinical decision-support model. It must not be used to make autonomous billing or treatment decisions without review by a credentialed coder or clinician. The training data is sourced from a single organization's coder-reviewed dataset and may carry institutional coding preferences.
Recovery / Checkpoints
If a deployment ever needs to roll back, the following snapshots are available on the Hub:
| Checkpoint | Where | Notes |
|---|---|---|
| V17 final (best step 3,600) | this repo | eval_loss = 0.0824 |
| V17 Epoch-1 (step 3,500) | vineetdaniels/NYXMed-V17-Epoch1 |
eval_loss = 0.0875, frozen for safety |
| V16 (base) | vineetdaniels/NYXMed-V16-Model |
Required to load this adapter |
Adapter weights (adapter_model.safetensors) are 1.66 GB. Full training history is available in training_metrics.json and TensorBoard logs in this repo under logs/.
Acknowledgements
Built on Meta's Llama-3 via Hugging Face's transformers, peft, accelerate, and deepspeed libraries.
- Downloads last month
- 543
Model tree for vineetdaniels/NYXMed-V17-Model
Base model
vineetdaniels/NYXMed-V16-Model