chest2err — Sentence-grounded Error Score for Chest CT Reports

chest2err is a sentence-grounded autoregressive evaluator that, given a (reference, candidate) chest CT report pair, outputs a single chest2err-score ∈ (0, 1] where higher is better. The score is interpretable: 1.0 means the candidate report is perfect; 0.72 means one error; below 0.20 means substantial errors.

The score is computed from a sequence of structured error tuples emitted by the decoder. Each tuple specifies an error's (category, anatomy) and points back at the specific reference sentence and candidate sentence that triggered it, so the score comes with built-in explanations.

Built on the chest2vec/chest2vec_0.6b backbone with LoRA fine-tuning + a 4-layer Transformer decoder. All backbone and decoder weights are bundled in this repository — no further downloads are required at inference time.

Evaluation benchmark: chest2vec/chest2error-bench (400 (reference, candidate) pairs labeled by a board-certified thoracic radiologist with 15 years of experience).

The chest2err-score

chest2err_score = exp(−K_total / τ)        # τ = 3.0 (default)

where K_total is the total number of error tuples emitted by the decoder and Ï„ is a display temperature (score_temperature in chest2err_config.json).

chest2err-score K_total interpretation
1.00 0 perfect — no errors detected
0.72 1 one error
0.51 2 two errors
0.37 3 substantial errors
0.19 ≥ 5 severely degraded

Higher = better. Drop-in replacement for GREEN-score / RadCliQ / BERTScore as a single-number quality signal in (0, 1].

The temperature τ only rescales the displayed number for human readability — a single error no longer collapses the score. Set τ=1.0 to recover the original exp(−K_total) scale (1 → 0.37, 2 → 0.14). Because exp(−K_total/τ) is a strictly monotone function of K_total for any τ>0, the score is rank-equivalent to −K_total, so all Kendall τ_b benchmarks transfer unchanged from the count form regardless of τ.

Headline metrics

Evaluated on the 400-pair chest2error-bench gold set:

metric value
Kendall τ_b vs total errors +0.665
Kendall τ_b vs Critical errors (radiologist labels) +0.763
Kendall τ_b vs severity-weighted errors (radiologist labels) +0.734
Pairwise within-anchor accuracy 0.958 (n=1020)
Critical-error AUROC 0.963
MAE of K_total 1.12
chest2err-score on GT-S ↔ GT-U equivalence pairs 1.00 ± 0.00 (perfect content-equivalence recognition)

The τ_b numbers against Critical / severity-weighted errors use the radiologist's severity labels in the gold set (the model itself does not output severity in v0.1; see Limitations). They demonstrate that the predicted K_total correlates strongly with the human Critical-error count even without an explicit severity head.

For comparison on the same benchmark: BLEU τ_b = +0.235, BERTScore = +0.254, RadGraph = +0.232, RadCliQ = +0.239, GREEN = +0.047, CRIMSON-GPT (gpt-5.2) = +0.530. chest2err beats every prior radiology evaluation metric on chest CT by ≥ +0.23 τ_b.

CXR/CT generalization

corpus τ_b vs Critical
ReXVal (CXR, n=200) +0.682
Chest CT (this benchmark, n=400) +0.763

Most prior metrics lose 0.4–0.7 τ_b crossing from CXR to CT. chest2err is the only metric that gains on CT — because it was trained on CT.

Architecture

component spec
Backbone chest2vec/chest2vec_0.6b (596 M params, bf16) — fully merged into this repo
chest2err LoRA rank 32, α 64, dropout 0.05 — merged into the backbone weights shipped here
Decoder 4-layer Transformer, 8 heads, FFN 2048
Max decode steps 24 (hard cap; suffices for max-K=17 observed in radiologist gold)
Output tuple (cat 1-5, anat 0-8, concept, ref_seg_idx, cand_seg_idx)
Pooling mean-pool tokens within each sentence; prepend learnable NULL_REF and NULL_CAND vectors per side

The decoder is cross-attended over the concatenated reference + candidate sentence-pool memory M. At each step it predicts a tuple where cat = 0 is the EOS token. Counts emerge as len(seq) − 1.

Mean-pooling sentences before the decoder makes the encoder paraphrase-robust (inherits chest2vec's contrastive properties) and the decoder permutation-invariant with respect to sentence order.

Files

file size purpose
model.safetensors ~1.1 GB merged backbone weights (chest2vec_0.6b + chest2err LoRA, fused)
config.json <1 KB backbone architecture config
decoder.safetensors ~207 MB decoder + null embeddings + heads
chest2err_modeling.py 14 KB decoder architecture (the CADAD class)
chest2err.py 6 KB self-contained loader (chest2err_score, chest2err_detail)
chest2err_config.json <1 KB chest2err model meta-config
tokenizer.json, vocab.json, etc. ~14 MB tokenizer files

Total: ~1.36 GB. Everything required to run chest2err is in this repository.

Quick start

from chest2err import chest2err_score, chest2err_detail

ref  = "[Lungs] No pulmonary nodules. [Pleura] No effusion."
cand = "[Lungs] Several pulmonary nodules in the left upper lobe."

score = chest2err_score(ref, cand)
# 0.37 — substantial errors (K_total = 3, τ = 3.0)

detail = chest2err_detail(ref, cand)
# detail["score"]           — chest2err-score in (0, 1]
# detail["K_total"]         — integer total error count
# detail["tuples"]          — list of {cat, anat, ref_seg_idx, cand_seg_idx, …}
# detail["category_counts"] — per-category breakdown
# detail["anatomy_counts"]  — per-anatomy breakdown

The loader picks up the bundled weights automatically; no extra setup beyond pip install transformers torch peft safetensors is needed.

Output schema

The primary output is the chest2err-score ∈ (0, 1] (computed from exp(−K_total / τ) with τ = 3.0 as above). The score is backed by a sequence of structured error tuples:

{
    "cat":          int,  # 1..5 (ReXVal 5-category merged: false_prediction, omission, location, severity, comparison)
    "anat":         int,  # 0..8 (Lungs & Airways, Pleura, ... Others)
    "concept":      int,  # leaf concept id (clinical finding vocabulary)
    "ref_seg_idx":  int,  # -1 = NULL_REF, otherwise sentence index in reference report
    "cand_seg_idx": int,  # -1 = NULL_CAND, otherwise sentence index in candidate report
}

cat == 0 is the EOS marker; the model stops when it emits it. K_total = len(tuples) − 1, and chest2err_score = exp(−K_total / τ) with τ = 3.0.

Training data

Trained on chest2vec/chest2err-train (in preparation): 53,881 (reference, candidate, labeled_errors) triples spanning 4 candidate styles (V1-V4) + a V5 high-error supplement. Validation: the 200-variant slice of chest2vec/chest2error-bench (radiologist gold).

Variant generation (LLM-injected errors)

Reference reports are sourced from the CT-RATE chest CT corpus. For each reference report we prompted GPT-4o-mini to produce four candidate variants that deliberately insert a controlled number of errors drawn from the ReXVal 6-category error taxonomy. The LLM was instructed to also output, for every inserted error, a structured label:

  • error category (1–6, ReXVal taxonomy: false_prediction, omission, location, severity, spurious_comparison, omitted_comparison)
  • anatomy section (Lungs & Airways, Pleura, Mediastinum & Hila, Cardiovascular, Chest Wall, Bones / Spine, Upper Abdomen, Lower Neck, Others)
  • target finding concept (leaf finding from the chest CT vocabulary)

Each training example is therefore a (reference, candidate, [per-error (category, anatomy, concept) triples]) record. The model is supervised to reproduce this structured error trace given only the (reference, candidate) input.

Training objective

Supervised teacher-forced training on the LLM-labeled error sequences:

  • Per-step token losses on (category, anatomy, concept) heads at each decoder step
  • Pointer losses on ref_seg_idx and cand_seg_idx (which sentence each error refers to)

Backbone fine-tuning uses LoRA on chest2vec_0.6b; both the chest2vec contrastive adapter and the chest2err LoRA are merged into the bundled weights here.

Why this works

  • GPT-4o-mini reliably emits the exact error count and tagged structure requested by the prompt, giving us noiseless K at training time.
  • The radiologist gold benchmark (chest2error-bench) shows that learning on LLM-injected errors transfers to human-labeled errors at deployment with Ï„_b vs Critical = +0.763.
  • Sentence-grounded pointer supervision (which ref and cand sentences are responsible for each error) is what makes the model interpretable — every emitted error tuple cites its source sentences.

Limitations

  • No severity output in v0.1. The model emits a structurally typed error tuple without distinguishing Critical from Minor. GPT-4o-mini's variant labels do not include severity, so the training signal for that head is too thin to release. The canonical chest2err_score = exp(−K_total / Ï„) (Ï„ = 3.0) treats every emitted error equally. A severity-aware variant is the headline item on the roadmap.
  • Reference dependence. chest2err is a paired metric. It cannot evaluate a candidate against no reference.
  • English only. Trained on English chest CT reports from CT-RATE.
  • Chest CT only. Cross-domain performance (e.g. abdominal CT) is not validated.
  • 24-error hard cap. Reports with > 24 errors are clipped (rare; max observed in gold = 17).
  • Single-radiologist gold. Inter-rater calibration is in progress.

Citations

If you use chest2err, please cite ReXVal (basis for the taxonomy and endpoint), CT-RATE (source of chest CT reports), and this model:

@misc{rexval2023,
  title     = {{ReXVal}: Radiologist-Verified Evaluation of Automated Radiology Report Metrics},
  author    = {Yu, F. and Endo, M. and Krishnan, R. and others},
  year      = {2023},
  publisher = {PhysioNet},
  url       = {https://physionet.org/content/rexval-dataset/1.0.0/}
}

@misc{hamamci2024ctrate,
  title         = {A foundation model utilizing chest CT volumes and radiology reports for supervised-level zero-shot detection of abnormalities},
  author        = {Hamamci, Ibrahim Ethem and Er, Sezgin and Almas, Furkan and others},
  year          = {2024},
  eprint        = {2403.17834},
  archivePrefix = {arXiv},
  url           = {https://huggingface.co/datasets/ibrahimhamamci/CT-RATE}
}

@misc{chest2err2026,
  title  = {chest2err: Sentence-grounded Error Score for Chest CT Reports},
  author = {chest2vec contributors},
  year   = {2026},
  url    = {https://huggingface.co/chest2vec/chest2err}
}

Related

  • Backbone: chest2vec/chest2vec_0.6b — the chest2vec encoder this model is built on
  • Eval benchmark: chest2vec/chest2error-bench — radiologist-labeled 400-pair gold set
  • CXR analogue (taxonomy basis): ReXVal — Radiologist-Verified Evaluation, chest X-ray (n=200)
  • Source of reference reports: CT-RATE — chest CT volumes + radiology reports corpus

License

CC-BY-NC-4.0. Released for research use.

Downloads last month
13
Safetensors
Model size
0.6B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train chest2vec/chest2err

Paper for chest2vec/chest2err