Fundus Lesion Image Classification — 9-Model Comparative Benchmark

Companion artifact for the Master's thesis "Classification of Fundus Lesion Images Using Deep Learning Models" (Xidian University, 2026), by Daryl Panashe Katiyo.

Reproducible weights, predictions, and full statistical analysis for nine deep-learning backbones evaluated on a 10-class colour-fundus dataset with a group-aware (perceptual-hash) 5-fold cross-validation protocol.


Important Note on Tables

Two result tables appear in this README. They measure different things and must not be mixed:

Table Protocol # runs Use for
§ 5.1 — 5-fold CV (authoritative) 5 independent train/val splits, fixed holdout test 5 Paper, model comparison, all citations
§ 5.2 — Single-run baseline One training run, fold-0 split 1 Legacy reference, documents CLIP variance

Always cite § 5.1 in the paper. The single-run baseline (§ 5.2) is retained for completeness — CLIP underperforms there (86.25%) due to a single unlucky initialisation versus 90.15% over 5 folds.


1. Abstract

Automatic interpretation of colour fundus photographs is a foundational task for screening prevalent blinding diseases such as diabetic retinopathy, glaucoma and age-related macular degeneration. We benchmark nine deep-learning backbones spanning four architectural families — classical CNNs (VGG-19, ResNet-50, ResNet-101, DenseNet-121, Inception-v3), vision-language pretraining (OpenAI CLIP ViT-B/16), self-supervised vision transformers (DINOv2-L/14), hierarchical transformers (Swin-B), and a domain-specific MAE pretraining (RETFound MAE ViT-L/16) — on a 10-class fundus dataset of 16 242 augmented images. To suppress augmentation-induced label leakage we construct a group-aware (perceptual-hash) stratified 5-fold split and report mean accuracy, F1, ROC-AUC, ECE, Cohen's κ, Brier score, bootstrap 95% confidence intervals, Bonferroni-corrected McNemar tests, and 90% Mondrian conformal sets.

Headline result (5-fold CV). Classical CNNs and CLIP ViT-B/16 are statistically indistinguishable at the top, with Inception-v3 leading at 90.18% accuracy (F1 = 92.54%, κ = 0.884, ROC-AUC = 0.9930). Contrary to expectations, foundation models pretrained on general vision (DINOv2-L: 89.61%, Swin-B: 87.00%) or fundus images (RETFound: 83.35%) do not outperform classical CNNs on this moderate-sized dataset. A soft-vote ensemble of all nine models reaches ROC-AUC = 0.9941 and accuracy = 89.68%.


2. Motivation & Model Selection

Modern fundus screening pipelines are increasingly built on pre-trained image backbones, but the question "which backbone family is best for fundus disease classification on a moderately-sized, imbalanced dataset?" has no consensus answer. We deliberately chose backbones that exercise four distinct inductive biases / pretraining regimes:

Family Backbone(s) Why included
Classical CNNs VGG-19, ResNet-50, ResNet-101, DenseNet-121, Inception-v3 Established baselines used in virtually all prior fundus benchmarks (Gulshan 2016, Ting 2017). Locally-connected convolutions suit texture-dominant retinal pathology.
Vision-language (CLIP) OpenAI CLIP ViT-B/16 Tests whether 400 M-pair web-scale contrastive pretraining transfers to a tightly-constrained medical domain.
Self-supervised ViT DINOv2-L/14 State-of-the-art general-purpose features without language supervision (Oquab 2024).
Hierarchical ViT Swin-B Adds hierarchy + shifted windows; competitive on ImageNet at lower compute than ViT-L (Liu 2021).
Domain MAE RETFound MAE ViT-L/16 Pretrained on 1.6 M colour fundus images (Zhou 2023, Nature); the strongest published prior on this modality.

This grid isolates three confounders: (i) scale (ResNet-50 vs ResNet-101; ViT-B vs ViT-L); (ii) pretraining modality (ImageNet supervised vs CLIP language-supervised vs DINOv2 self-supervised vs RETFound domain-MAE); and (iii) architecture class (CNN vs ViT vs hierarchical).


3. Dataset

  • Source. Mendeley Data (10 classes; 5 335 original images).
  • Augmentation. Class-balancing augmentation expanded the pool to 16 242 images (rotation, horizontal flip, brightness/contrast jitter, Gaussian blur). Each augmented image carries its source's diagnostic label.
  • Companion dataset: DoB24/fundus-10class-augmented.

3.1 Class distribution

Index Class Original Augmented
0 Central Serous Chorioretinopathy 101 606
1 Diabetic Retinopathy 1 509 3 444
2 Disc Edema 127 762
3 Glaucoma 1 349 2 880
4 Healthy 1 024 2 676
5 Macular Scar 444 1 937
6 Myopia 500 2 251
7 Pterygium 17 102
8 Retinal Detachment 125 750
9 Retinitis Pigmentosa 139 834
Total 5 335 16 242

3.2 Group-aware splitting (data-leakage prevention)

Because the augmented set contains visually near-duplicate copies of each original image, a naïve train_test_split over the augmented pool would let models memorise patient-level identities. We prevent this by:

  1. Computing a 64-bit perceptual hash (pHash) on every image.
  2. Linking each augmented image to its nearest original at Hamming distance ≤ 8 → defines a group_id.
  3. Running scikit-learn StratifiedGroupKFold(n_splits=5) so that all augmented children of a given original sit in exactly one fold.

The held-out test set is fixed across all 5 folds: 3 208 images (≈ 19.8% of the augmented pool). Exact manifest: splits/holdout_split_augmented.json (3.2 MB).


4. Training Protocol

4.1 CNN / CLIP backbones

Hyper-parameter Value
Optimizer AdamW (β₁=0.9, β₂=0.999, weight-decay=1×10⁻⁴)
Initial LR 2×10⁻⁴
LR schedule 3-epoch linear warm-up + cosine decay to 0
Epochs Up to 60 (early stop patience=12 on val F1)
Batch size 32
Image size 224×224 (Inception-v3: 299×299)
Preprocessing CLAHE (LAB L-channel) → RandAugment (n=2, m=9) → ImageNet normalisation
Imbalance WeightedRandomSampler (weights ∝ 1/class_count)
Regularisation MixUp (α=0.2) + CutMix (α=1.0, p=0.7)
Mixed precision torch.amp.autocast + GradScaler
Test-time aug 6 views (centre + 4 corners + h-flip), soft-vote mean
Hardware PyTorch 2.11 + CUDA 12.8, NVIDIA Tesla T4 (16 GB)

4.2 Foundation model backbones (DINOv2-L, Swin-B, RETFound)

Two-stage schedule per fold:

Stage Layers trained Epochs Head LR Backbone LR
Linear probe Head only 20 1×10⁻³ frozen
Full fine-tune All layers 15 1×10⁻⁴ 1×10⁻⁵

Batch size: 24. Early stopping patience: 8 epochs on val F1.

4.3 Ensemble

F1-weighted soft-vote across all 9 models, using each model's validation F1 as the weight.


5. Results

All metrics are on the fixed 3 208-image holdout test set.


5.1 Five-fold cross-validation — authoritative results (use in paper)

Acc, F1, 95% CI, ROC-AUC, and ECE are means over 5 independent training runs. κ and Brier for CNN/CLIP are from 5-fold pooled predictions. For foundation models (†), κ and Brier are from single-run inference on the same holdout (fold-level predictions not stored); acc/F1/ROC/ECE are still 5-fold averages.

Rank Model Acc (%) 95% CI F1 (%) F1 95% CI κ Brier ROC-AUC ECE
1 inception_v3 90.18 [89.24, 91.24] 92.54 [91.68, 93.40] 0.884 0.150 0.9930 0.0194
2 clip_openai 90.15 [89.18, 91.24] 92.83 [92.00, 93.61] 0.884 0.140 0.9944 0.0217
3 vgg19 90.12 [89.09, 91.12] 92.59 [91.77, 93.41] 0.884 0.150 0.9933 0.0228
4 resnet101 90.09 [89.15, 91.12] 92.63 [91.77, 93.47] 0.883 0.140 0.9941 0.0243
5 densenet121 89.65 [88.62, 90.71] 92.29 [91.37, 93.07] 0.878 0.150 0.9937 0.0272
6 dinov2_l 89.61 [88.57, 90.64] 92.27 [91.38, 93.08] 0.876† 0.160† 0.9934 0.0299
7 resnet50 89.50 [88.40, 90.59] 92.20 [91.34, 93.03] 0.876 0.140 0.9945 0.0339
8 swin_b 87.00 [85.92, 88.15] 90.26 [89.31, 91.19] 0.845† 0.190† 0.9896 0.0294
9 retfound 83.35 [82.16, 84.65] 87.27 [86.22, 88.35] 0.810† 0.240† 0.9834 0.0242
9-Model Ensemble 89.68 [88.65, 90.74] 92.25 0.878 0.144 0.9941 0.0198

† κ and Brier for DINOv2-L, Swin-B, RETFound from single-run holdout inference. All other foundation model metrics are 5-fold CV averages.

Key findings:

  • Top-4 models (Inception-v3, CLIP, VGG-19, ResNet-101) are statistically indistinguishable: all 6 pairwise McNemar tests p > 0.05 after Bonferroni correction (see kfold/cnn_clip/mcnemar.json).
  • DINOv2-L (89.61%) matches DenseNet-121 (89.65%) within noise despite having 7× more parameters.
  • RETFound, pretrained on 1.6 M fundus images, ranks last — the LP+FT protocol with patience=8 may be insufficient on this dataset size.
  • The ensemble gains ROC-AUC parity with the best individual model.

Machine-readable full table: kfold/kfold_v2_summary.csv


5.2 Single-run baseline (§ 5.2 — for reference only, do not cite in paper)

Retained to document the original preliminary experiment (one run, fold-0 split). CLIP's 86.25% here vs 90.15% in § 5.1 is single-run variance, not an architectural effect.

Rank Model Acc (%) 95% CI F1 (%) κ Brier ROC-AUC
1 densenet121 89.78 [88.71, 90.80] 92.26 0.879 0.148 0.9931
2 dinov2_l 89.50 [88.47, 90.55] 92.15 0.876 0.155 0.9938
3 vgg19 89.31 [88.22, 90.40] 92.12 0.874 0.154 0.9930
4 resnet101 89.25 [88.19, 90.34] 92.05 0.873 0.149 0.9941
5 inception_v3 89.21 [88.15, 90.28] 91.97 0.873 0.157 0.9934
6 resnet50 89.09 [88.00, 90.12] 91.87 0.871 0.147 0.9944
7 swin_b 86.85 [85.69, 88.03] 90.44 0.845 0.185 0.9904
8 clip_openai 86.25 [85.10, 87.41] 89.99 0.838 0.195 0.9896
9 retfound 83.88 [82.64, 85.10] 87.68 0.810 0.238 0.9838
9-Model Ensemble 89.68 [88.65, 90.74] 92.25 0.878 0.144 0.9941

The ensemble row is identical in both tables — it is computed once on the fixed holdout and does not depend on the training run.


5.3 Statistical significance

All 15 pairwise Bonferroni-corrected McNemar tests among the 6 CNN/CLIP models: p > 0.05 — no statistically significant pairwise differences. Full χ² and p-value matrix: kfold/cnn_clip/mcnemar.json.


5.4 Calibration (ECE)

Model ECE
inception_v3 0.0194 Best calibrated
clip_openai 0.0217
vgg19 0.0228
retfound 0.0242
resnet101 0.0243
densenet121 0.0272
swin_b 0.0294
dinov2_l 0.0299
resnet50 0.0339
9-Model Ensemble 0.0198 Best overall

Reliability diagrams: analysis/reliability_diagrams/reliability_<model>.png.


6. Reproducibility

6.1 Load a model (PyTorch ≥ 2.6)

import torch, timm
from huggingface_hub import hf_hub_download

# CNN backbone — any of: inception_v3, densenet121, vgg19, resnet101, resnet50
ckpt = hf_hub_download("DoB24/fundus-9model-benchmark",
                       "weights/inception_v3_v2_final.pth")
model = timm.create_model("inception_v3", num_classes=10)
state = torch.load(ckpt, map_location="cpu", weights_only=False)
model.load_state_dict(state["model"] if "model" in state else state)
model.eval()
# CLIP ViT-B/16
import open_clip, torch
from huggingface_hub import hf_hub_download

ckpt = hf_hub_download("DoB24/fundus-9model-benchmark",
                       "weights/clip_openai_v2_final.pth")
model, _, preprocess = open_clip.create_model_and_transforms(
    "ViT-B-16", pretrained="openai")
state = torch.load(ckpt, map_location="cpu", weights_only=False)
model.load_state_dict(state["model"] if "model" in state else state)
model.eval()

6.2 Inference preprocessing

from torchvision import transforms
import cv2
from PIL import Image

def clahe_preprocess(img_path):
    img = cv2.imread(str(img_path))
    lab = cv2.cvtColor(img, cv2.COLOR_BGR2LAB)
    l, a, b = cv2.split(lab)
    clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
    lab = cv2.merge([clahe.apply(l), a, b])
    img = cv2.cvtColor(lab, cv2.COLOR_LAB2RGB)
    return Image.fromarray(img)

val_transform = transforms.Compose([
    transforms.Resize((224, 224)),   # use 299 for inception_v3
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406],
                         [0.229, 0.224, 0.225]),
])

6.3 Class index mapping

0: Central Serous Chorioretinopathy [Color Fundus]
1: Diabetic Retinopathy
2: Disc Edema
3: Glaucoma
4: Healthy
5: Macular Scar
6: Myopia
7: Pterygium
8: Retinal Detachment
9: Retinitis Pigmentosa

6.4 Quick inference

pip install torch torchvision timm open_clip_torch huggingface_hub pillow opencv-python
python code/inference_example.py path/to/fundus.jpg --model inception_v3

7. Files in this repository

Path Description
weights/<model>_v2_final.pth ×9 Fine-tuned weights — dict with keys model, optimizer, epoch
results/<model>_test.json ×9 Single-run holdout metrics (acc, F1, κ, Brier, ROC-AUC, per-class)
results/<model>_test_preds.json ×9 Single-run labels + preds + probs (3 208 items)
results/ensemble_report.json Ensemble + McNemar + conformal report (single-run)
kfold/kfold_v2_summary.csv Authoritative 9-model 5-fold summary (machine-readable)
kfold/cnn_clip/summary.json 5-fold aggregated means + CI for CNN/CLIP models
kfold/cnn_clip/<model>_kfold.json ×6 Per-fold val metrics for CNN/CLIP
kfold/cnn_clip/<model>_test_preds.json ×6 5-fold pooled predictions
kfold/cnn_clip/mcnemar.json 15 pairwise McNemar tests
kfold/foundation_fold{0-4}_{model}.json ×15 Per-fold test metrics for foundation models
splits/holdout_split_augmented.json pHash-grouped 5-fold manifest (3.2 MB)
analysis/confusion_matrices/cm_<model>.png ×10 Per-model confusion matrices
analysis/roc_curves/roc_<model>.png ×10 One-vs-rest ROC curves
analysis/reliability_diagrams/reliability_<model>.png ×10 Calibration reliability diagrams
analysis/per_class_metrics.csv Precision / recall / F1 / support per model per class
analysis/ece_summary.json ECE values all models
gradcam/gradcam_<model>.png ×9 GradCAM / input-gradient saliency maps
code/hparams.json Full hyperparameter table
CITATION.cff Citation File Format

8. Compute Disclosure

All 9 models trained across 5 folds on a single NVIDIA Tesla T4 (16 GB), PyTorch 2.11.0+cu128.

Model Approx. GPU-hours (5 folds total)
VGG-19 12.5
ResNet-50 11.5
ResNet-101 15.5
DenseNet-121 14.0
Inception-v3 11.0
CLIP ViT-B/16 19.0
DINOv2-L 56.0
Swin-B 22.5
RETFound 47.0
Ensemble + stats 0.3
Total ~209 GPU-hours

9. Citation

@mastersthesis{katiyo2026fundus,
  author  = {Katiyo, Daryl Panashe},
  title   = {Classification of Fundus Lesion Images Using Deep Learning Models},
  school  = {Xidian University},
  year    = {2026},
  note    = {Companion artifact: \url{https://huggingface.co/DoB24/fundus-9model-benchmark}}
}
@dataset{nayan2023fundus,
  author  = {Nayan, Asma U. and Saha, Sajib K. et al.},
  title   = {A Curated Dataset of Retinal Fundus Images for Disease Classification},
  year    = {2023},
  doi     = {10.17632/s9bfhswzjb.1},
  url     = {https://data.mendeley.com/datasets/s9bfhswzjb/1}
}

10. References

  1. Gulshan V. et al. "Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs." JAMA 316.22 (2016): 2402–2410.
  2. Ting D.S.W. et al. "Development and Validation of a Deep Learning System for Diabetic Retinopathy and Related Eye Diseases." JAMA 318.22 (2017): 2211–2223.
  3. Oquab M. et al. "DINOv2: Learning Robust Visual Features without Supervision." arXiv:2304.07193 (2023).
  4. Liu Z. et al. "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows." ICCV 2021.
  5. Zhou Y. et al. "A foundation model for generalizable disease detection from retinal images." Nature 622 (2023): 156–163.
  6. He K. et al. "Deep Residual Learning for Image Recognition." CVPR 2016.
  7. Simonyan K., Zisserman A. "Very Deep Convolutional Networks for Large-Scale Image Recognition." ICLR 2015.
  8. Huang G. et al. "Densely Connected Convolutional Networks." CVPR 2017.
  9. Szegedy C. et al. "Rethinking the Inception Architecture for Computer Vision." CVPR 2016.
  10. Radford A. et al. "Learning Transferable Visual Models From Natural Language Supervision." ICML 2021.
  11. Zhang H. et al. "mixup: Beyond Empirical Risk Minimization." ICLR 2018.
  12. Yun S. et al. "CutMix: Regularization Strategy to Train Strong Classifiers." ICCV 2019.
  13. Cubuk E.D. et al. "RandAugment: Practical Automated Data Augmentation." NeurIPS 2020.
  14. Vovk V., Gammerman A., Shafer G. Algorithmic Learning in a Random World. Springer, 2005.
  15. Bonferroni C.E. "Teoria statistica delle classi e calcolo delle probabilità." 1936.

11. License & Contact

Apache-2.0 for code and weights. Original Mendeley dataset: CC BY 4.0.

Questions / collaboration: open an Issue on the Hub repo.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train DoB24/fundus-9model-benchmark

Papers for DoB24/fundus-9model-benchmark