Model Card for Pneumonia Chest X-Ray Classifier

A set of deep-learning models that classify pediatric chest X-rays for pneumonia. Two tasks are covered:

  • Binary: NORMAL vs PNEUMONIA.
  • Three-class: NORMAL vs BACTERIAL vs VIRAL pneumonia.

Models are ImageNet-pretrained ResNet18 and DenseNet121 backbones fine-tuned on the Kermany chest X-ray dataset, plus a from-scratch custom CNN baseline. The repository also includes a hierarchical (two-stage) bacteria-vs-virus classifier. DenseNet121 is the recommended single model; a ResNet18 + DenseNet121 ensemble is the strongest predictor.

Model Details

Model Description

The models take a single chest X-ray (anterior–posterior view) and output a class probability. Input images are resized to 224×224, converted to 3 channels, and normalised with ImageNet statistics. Transfer-learning backbones are trained in two phases (head-only, then full fine-tuning with early stopping); each model is trained with three seeds (0, 1, 2) for reproducibility.

  • Developed by: Luana Carolina Reis and Jakub Błaszczyk (University of Aveiro, Portugal; Lodz University of Technology, Poland)
  • Funded by: Not funded (academic course project)
  • Shared by: luanacarolina
  • Model type: Image classification (CNN); binary and 3-class chest-X-ray pneumonia detection
  • Language(s) (NLP): N/A (image model)
  • License: CC BY 4.0 (weights); dataset under CC BY 4.0
  • Finetuned from model: torchvision ResNet18 and DenseNet121, ImageNet-1k pretrained weights (the custom CNN is trained from scratch)

Model Sources

Uses

Direct Use

Research and educational use: classifying pediatric chest X-rays as NORMAL vs PNEUMONIA (binary) or NORMAL/BACTERIA/VIRUS (three-class), and reproducing the reported results. The accompanying Grad-CAM tooling can be used to visualise which lung regions drive each prediction.

Downstream Use

The checkpoints can serve as a starting point for further fine-tuning on other (e.g. adult or multi-institution) chest X-ray datasets, or as a baseline in pneumonia-detection benchmarks.

Out-of-Scope Use

Not for clinical use. This is a course project, not a medical device, and must not be used for real diagnosis, triage, or treatment decisions. It is trained on single-institution pediatric data and is not validated for adults, other scanners/populations, or imaging modalities other than AP chest X-ray. The three-class viral/bacterial distinction in particular is unreliable (see Limitations).

Bias, Risks, and Limitations

  • Single-institution, pediatric data (Guangzhou Women and Children's Medical Center, ages 1–5). Generalisation to adults or other hospitals/scanners is untested.
  • Class imbalance: the training set has ~2× more bacterial than viral or normal images; we mitigate with inverse-frequency class weights, but it still biases behaviour.
  • Viral pneumonia is the bottleneck: its diffuse interstitial pattern overlaps with both normal lungs and early bacterial infiltrates, so VIRUS has the lowest F1 (~0.72) and drives the dominant NORMAL→VIRUS confusion. Neither class weighting, ensembling, nor a hierarchical design closed this gap.
  • Backbone interpretability differs: with similar metrics, DenseNet121's Grad-CAM heatmaps are anatomically focused, whereas ResNet18 often attends to non-parenchymal structures ("right answer, wrong reason"), a generalisation risk.
  • Modest specificity on the binary task (~0.62 for the ensemble): high pneumonia recall comes at the cost of false alarms on normal lungs.

Recommendations

Users (both direct and downstream) should treat outputs as research signals only, prefer DenseNet121 when an auditable explanation is needed and the ensemble when raw predictive performance matters, and validate on their own data before any deployment. Predictions for the VIRUS class should be treated with particular caution.

How to Get Started with the Model

Use the code below to get started with the model.

import torch
from torchvision import models, transforms
from huggingface_hub import hf_hub_download
from PIL import Image

# Download the primary binary checkpoint
ckpt = hf_hub_download(
    repo_id="luanacarolina/pneumonia-chest-xray-classifier",
    filename="checkpoints/binary/densenet121.pt",
)

model = models.densenet121(weights=None)
model.classifier = torch.nn.Linear(model.classifier.in_features, 1)  # binary BCE head
state = torch.load(ckpt, map_location="cpu", weights_only=True)
model.load_state_dict(state.get("model_state_dict", state))
model.eval()

tf = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.Grayscale(num_output_channels=3),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

img = tf(Image.open("xray.jpeg").convert("RGB")).unsqueeze(0)
with torch.no_grad():
    p_pneumonia = torch.sigmoid(model(img)).item()
print(f"P(pneumonia) = {p_pneumonia:.3f}")

For the three-class models use num_classes=3 with a softmax head and the checkpoints under checkpoints/three_class/.

Training Details

Training Data

Kermany Chest X-Ray Images (Pneumonia) dataset, distributed on Kaggle by Paul Mooney: https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia (≈5,856 AP pediatric chest X-rays; NORMAL / PNEUMONIA folders, with bacterial/viral encoded in the pneumonia filenames). Original data from Kermany et al. (2018).

Training Procedure

Preprocessing

Resize to 224×224, grayscale→3 channels, ImageNet normalisation. Training augmentation: random horizontal flip and ±10° rotation. For the three-class task we use a patient-aware stratified validation split (whole patients held out, ~12% per class) to avoid the data leakage caused by multiple X-rays per patient and by the tiny 16-image official val/ folder.

Training Hyperparameters

  • Training regime: fp32
  • Optimizer: Adam, learning rate 1e-4
  • Batch size: 32
  • Schedule: transfer models use 5 head-only epochs (backbone frozen) then up to 30 fine-tune epochs with early stopping (patience 5); the custom CNN runs up to 20 epochs from scratch
  • Loss: BCE-with-logits (binary); cross-entropy with inverse-frequency class weights (three-class)
  • Checkpoint selection: lowest val loss (binary); macro-F1 (three-class)
  • Seeds: 0, 1, 2

Speeds, Sizes, Times

Checkpoint sizes: custom CNN ≈1 MB, DenseNet121 ≈84 MB, ResNet18 ≈129 MB. Inference ≈13–16 ms/image on the training GPU.

Evaluation

Testing Data, Factors & Metrics

Testing Data

The dataset's held-out test/ split: 624 images (234 NORMAL, 242 BACTERIA, 148 VIRUS).

Factors

Results are reported per model, per class (NORMAL / BACTERIA / VIRUS), and as mean ± std over three seeds. Pneumonia recall is emphasised because, in screening, a missed pneumonia (false negative) is more costly than a false alarm.

Metrics

Accuracy, precision, recall (sensitivity), specificity, F1 (macro-F1 for three classes), one-vs-rest AUROC, and Expected Calibration Error (ECE). Confusion matrices and Grad-CAM overlays support qualitative analysis.

Results

Binary (NORMAL vs PNEUMONIA), test set, mean over seeds:

Model Accuracy F1 AUROC Recall
Custom CNN 0.756 0.828 0.886 0.914
ResNet18 0.830 0.879 0.949 0.991
DenseNet121 0.841 0.887 0.966 0.997
Ensemble 0.856 0.896 0.963 0.997

The binary ensemble misses only 1 of 390 pneumonia cases (recall 0.997); specificity is 0.62.

Three-class (NORMAL / BACTERIA / VIRUS), test set, patient-aware split:

Model Accuracy Macro-F1 AUROC F1 NORMAL / BACTERIA / VIRUS
ResNet18 0.769 0.751 0.944 0.69 / 0.90 / 0.67
DenseNet121 0.799 0.784 0.944 0.78 / 0.89 / 0.68
DenseNet121 (hierarchical) 0.790 0.774 0.944 0.73 / 0.90 / 0.69
Ensemble 0.822 0.808 0.960 0.78 / 0.92 / 0.72

The hierarchical two-stage design did not beat the flat softmax: evidence that the viral/bacterial boundary is intrinsic to the data, not a modelling artefact. Calibration: 3-class ECE ≈ 0.13 to 0.15 for single models and ≈ 0.05 for the ensemble.

Summary

Transfer learning clearly beats training from scratch. DenseNet121 is the best single model and the most interpretable; the ResNet18 + DenseNet121 ensemble is the strongest predictor on both tasks. Viral pneumonia is the hardest class throughout.

Model Examination

Grad-CAM was used on all four binary outcome types (TP/TN/FP/FN) and on the three-class outcomes. DenseNet121's heatmaps concentrate on clinically relevant lung regions (e.g. the affected lobe for bacterial consolidation), making its decisions auditable; ResNet18's are more often diffuse or centred on non-parenchymal structures even when the prediction is correct.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: NVIDIA RTX 3060 Laptop GPU
  • Hours used: ~5 hours total across all training runs (binary + three-class + hierarchical, 3 seeds each)
  • Cloud Provider: None (local machine)
  • Compute Region: Portugal
  • Carbon Emitted: Not formally measured; low (single consumer GPU, a few GPU-hours)

Technical Specifications

Model Architecture and Objective

ImageNet-pretrained ResNet18 (residual connections) and DenseNet121 (dense connectivity), with the final layer replaced by a task-specific head: a single-logit BCE head for the binary task and a softmax head (3 classes, or 2 for the hierarchical bacteria-vs-virus stage). A small custom CNN serves as a from-scratch baseline.

Compute Infrastructure

Hardware

NVIDIA RTX 3060 Laptop GPU.

Software

Python, PyTorch, torchvision, scikit-learn; weights distributed via huggingface_hub.

Citation

BibTeX:

@article{kermany2018,
  author  = {Kermany, Daniel S. and Goldbaum, Michael and Cai, Wenjia and others},
  title   = {Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning},
  journal = {Cell},
  volume  = {172},
  number  = {5},
  pages   = {1122--1131},
  year    = {2018},
  doi     = {10.1016/j.cell.2018.02.010}
}

APA:

Kermany, D. S., Goldbaum, M., Cai, W., et al. (2018). Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning. Cell, 172(5), 1122–1131. https://doi.org/10.1016/j.cell.2018.02.010

Glossary

  • Recall / sensitivity: fraction of true pneumonia cases the model flags; emphasised here for screening.
  • AUROC: area under the ROC curve; threshold-independent ranking quality.
  • ECE (Expected Calibration Error): average gap between predicted confidence and observed accuracy; lower is better.
  • Grad-CAM: gradient-based heatmaps showing which image regions drove a prediction.

More Information

The three notebooks (00_dataset_check, 01_final_experiments, 02_innovation_results) document the dataset exploration, binary results, and three-class / hierarchical / calibration analyses. The run_binary.ps1 and run_innovation.ps1 pipelines reproduce all training and evaluation.

Model Card Authors

Luana Carolina Reis, Jakub Błaszczyk (University of Aveiro, Portugal; Lodz University of Technology, Poland).

Model Card Contact

luanacarolina@ua.pt · jakub.blaszczyk@ua.pt

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for luanacarolina/pneumonia-chest-xray-classifier