Fundus Lesion Image Classification — 9-Model Comparative Benchmark
Companion artifact for the Master's thesis "Classification of Fundus Lesion Images Using Deep Learning Models" (Xidian University, 2026), by Daryl Panashe Katiyo.
Reproducible weights, predictions, and full statistical analysis for nine deep-learning backbones evaluated on a 10-class colour-fundus dataset with a group-aware (perceptual-hash) 5-fold cross-validation protocol.
Important Note on Tables
Two result tables appear in this README. They measure different things and must not be mixed:
Table Protocol # runs Use for § 5.1 — 5-fold CV (authoritative) 5 independent train/val splits, fixed holdout test 5 Paper, model comparison, all citations § 5.2 — Single-run baseline One training run, fold-0 split 1 Legacy reference, documents CLIP variance Always cite § 5.1 in the paper. The single-run baseline (§ 5.2) is retained for completeness — CLIP underperforms there (86.25%) due to a single unlucky initialisation versus 90.15% over 5 folds.
1. Abstract
Automatic interpretation of colour fundus photographs is a foundational task for screening prevalent blinding diseases such as diabetic retinopathy, glaucoma and age-related macular degeneration. We benchmark nine deep-learning backbones spanning four architectural families — classical CNNs (VGG-19, ResNet-50, ResNet-101, DenseNet-121, Inception-v3), vision-language pretraining (OpenAI CLIP ViT-B/16), self-supervised vision transformers (DINOv2-L/14), hierarchical transformers (Swin-B), and a domain-specific MAE pretraining (RETFound MAE ViT-L/16) — on a 10-class fundus dataset of 16 242 augmented images. To suppress augmentation-induced label leakage we construct a group-aware (perceptual-hash) stratified 5-fold split and report mean accuracy, F1, ROC-AUC, ECE, Cohen's κ, Brier score, bootstrap 95% confidence intervals, Bonferroni-corrected McNemar tests, and 90% Mondrian conformal sets.
Headline result (5-fold CV). Classical CNNs and CLIP ViT-B/16 are statistically indistinguishable at the top, with Inception-v3 leading at 90.18% accuracy (F1 = 92.54%, κ = 0.884, ROC-AUC = 0.9930). Contrary to expectations, foundation models pretrained on general vision (DINOv2-L: 89.61%, Swin-B: 87.00%) or fundus images (RETFound: 83.35%) do not outperform classical CNNs on this moderate-sized dataset. A soft-vote ensemble of all nine models reaches ROC-AUC = 0.9941 and accuracy = 89.68%.
2. Motivation & Model Selection
Modern fundus screening pipelines are increasingly built on pre-trained image backbones, but the question "which backbone family is best for fundus disease classification on a moderately-sized, imbalanced dataset?" has no consensus answer. We deliberately chose backbones that exercise four distinct inductive biases / pretraining regimes:
| Family | Backbone(s) | Why included |
|---|---|---|
| Classical CNNs | VGG-19, ResNet-50, ResNet-101, DenseNet-121, Inception-v3 | Established baselines used in virtually all prior fundus benchmarks (Gulshan 2016, Ting 2017). Locally-connected convolutions suit texture-dominant retinal pathology. |
| Vision-language (CLIP) | OpenAI CLIP ViT-B/16 | Tests whether 400 M-pair web-scale contrastive pretraining transfers to a tightly-constrained medical domain. |
| Self-supervised ViT | DINOv2-L/14 | State-of-the-art general-purpose features without language supervision (Oquab 2024). |
| Hierarchical ViT | Swin-B | Adds hierarchy + shifted windows; competitive on ImageNet at lower compute than ViT-L (Liu 2021). |
| Domain MAE | RETFound MAE ViT-L/16 | Pretrained on 1.6 M colour fundus images (Zhou 2023, Nature); the strongest published prior on this modality. |
This grid isolates three confounders: (i) scale (ResNet-50 vs ResNet-101; ViT-B vs ViT-L); (ii) pretraining modality (ImageNet supervised vs CLIP language-supervised vs DINOv2 self-supervised vs RETFound domain-MAE); and (iii) architecture class (CNN vs ViT vs hierarchical).
3. Dataset
- Source. Mendeley Data (10 classes; 5 335 original images).
- Augmentation. Class-balancing augmentation expanded the pool to 16 242 images (rotation, horizontal flip, brightness/contrast jitter, Gaussian blur). Each augmented image carries its source's diagnostic label.
- Companion dataset: DoB24/fundus-10class-augmented.
3.1 Class distribution
| Index | Class | Original | Augmented |
|---|---|---|---|
| 0 | Central Serous Chorioretinopathy | 101 | 606 |
| 1 | Diabetic Retinopathy | 1 509 | 3 444 |
| 2 | Disc Edema | 127 | 762 |
| 3 | Glaucoma | 1 349 | 2 880 |
| 4 | Healthy | 1 024 | 2 676 |
| 5 | Macular Scar | 444 | 1 937 |
| 6 | Myopia | 500 | 2 251 |
| 7 | Pterygium | 17 | 102 |
| 8 | Retinal Detachment | 125 | 750 |
| 9 | Retinitis Pigmentosa | 139 | 834 |
| — | Total | 5 335 | 16 242 |
3.2 Group-aware splitting (data-leakage prevention)
Because the augmented set contains visually near-duplicate copies of
each original image, a naïve train_test_split over the augmented
pool would let models memorise patient-level identities. We prevent
this by:
- Computing a 64-bit perceptual hash (
pHash) on every image. - Linking each augmented image to its nearest original at Hamming
distance ≤ 8 → defines a
group_id. - Running scikit-learn
StratifiedGroupKFold(n_splits=5)so that all augmented children of a given original sit in exactly one fold.
The held-out test set is fixed across all 5 folds: 3 208 images
(≈ 19.8% of the augmented pool). Exact manifest:
splits/holdout_split_augmented.json (3.2 MB).
4. Training Protocol
4.1 CNN / CLIP backbones
| Hyper-parameter | Value |
|---|---|
| Optimizer | AdamW (β₁=0.9, β₂=0.999, weight-decay=1×10⁻⁴) |
| Initial LR | 2×10⁻⁴ |
| LR schedule | 3-epoch linear warm-up + cosine decay to 0 |
| Epochs | Up to 60 (early stop patience=12 on val F1) |
| Batch size | 32 |
| Image size | 224×224 (Inception-v3: 299×299) |
| Preprocessing | CLAHE (LAB L-channel) → RandAugment (n=2, m=9) → ImageNet normalisation |
| Imbalance | WeightedRandomSampler (weights ∝ 1/class_count) |
| Regularisation | MixUp (α=0.2) + CutMix (α=1.0, p=0.7) |
| Mixed precision | torch.amp.autocast + GradScaler |
| Test-time aug | 6 views (centre + 4 corners + h-flip), soft-vote mean |
| Hardware | PyTorch 2.11 + CUDA 12.8, NVIDIA Tesla T4 (16 GB) |
4.2 Foundation model backbones (DINOv2-L, Swin-B, RETFound)
Two-stage schedule per fold:
| Stage | Layers trained | Epochs | Head LR | Backbone LR |
|---|---|---|---|---|
| Linear probe | Head only | 20 | 1×10⁻³ | frozen |
| Full fine-tune | All layers | 15 | 1×10⁻⁴ | 1×10⁻⁵ |
Batch size: 24. Early stopping patience: 8 epochs on val F1.
4.3 Ensemble
F1-weighted soft-vote across all 9 models, using each model's validation F1 as the weight.
5. Results
All metrics are on the fixed 3 208-image holdout test set.
5.1 Five-fold cross-validation — authoritative results (use in paper)
Acc, F1, 95% CI, ROC-AUC, and ECE are means over 5 independent training runs. κ and Brier for CNN/CLIP are from 5-fold pooled predictions. For foundation models (†), κ and Brier are from single-run inference on the same holdout (fold-level predictions not stored); acc/F1/ROC/ECE are still 5-fold averages.
| Rank | Model | Acc (%) | 95% CI | F1 (%) | F1 95% CI | κ | Brier | ROC-AUC | ECE |
|---|---|---|---|---|---|---|---|---|---|
| 1 | inception_v3 |
90.18 | [89.24, 91.24] | 92.54 | [91.68, 93.40] | 0.884 | 0.150 | 0.9930 | 0.0194 |
| 2 | clip_openai |
90.15 | [89.18, 91.24] | 92.83 | [92.00, 93.61] | 0.884 | 0.140 | 0.9944 | 0.0217 |
| 3 | vgg19 |
90.12 | [89.09, 91.12] | 92.59 | [91.77, 93.41] | 0.884 | 0.150 | 0.9933 | 0.0228 |
| 4 | resnet101 |
90.09 | [89.15, 91.12] | 92.63 | [91.77, 93.47] | 0.883 | 0.140 | 0.9941 | 0.0243 |
| 5 | densenet121 |
89.65 | [88.62, 90.71] | 92.29 | [91.37, 93.07] | 0.878 | 0.150 | 0.9937 | 0.0272 |
| 6 | dinov2_l |
89.61 | [88.57, 90.64] | 92.27 | [91.38, 93.08] | 0.876† | 0.160† | 0.9934 | 0.0299 |
| 7 | resnet50 |
89.50 | [88.40, 90.59] | 92.20 | [91.34, 93.03] | 0.876 | 0.140 | 0.9945 | 0.0339 |
| 8 | swin_b |
87.00 | [85.92, 88.15] | 90.26 | [89.31, 91.19] | 0.845† | 0.190† | 0.9896 | 0.0294 |
| 9 | retfound |
83.35 | [82.16, 84.65] | 87.27 | [86.22, 88.35] | 0.810† | 0.240† | 0.9834 | 0.0242 |
| — | 9-Model Ensemble | 89.68 | [88.65, 90.74] | 92.25 | — | 0.878 | 0.144 | 0.9941 | 0.0198 |
† κ and Brier for DINOv2-L, Swin-B, RETFound from single-run holdout inference. All other foundation model metrics are 5-fold CV averages.
Key findings:
- Top-4 models (Inception-v3, CLIP, VGG-19, ResNet-101) are
statistically indistinguishable: all 6 pairwise McNemar tests p > 0.05
after Bonferroni correction (see
kfold/cnn_clip/mcnemar.json). - DINOv2-L (89.61%) matches DenseNet-121 (89.65%) within noise despite having 7× more parameters.
- RETFound, pretrained on 1.6 M fundus images, ranks last — the LP+FT protocol with patience=8 may be insufficient on this dataset size.
- The ensemble gains ROC-AUC parity with the best individual model.
Machine-readable full table: kfold/kfold_v2_summary.csv
5.2 Single-run baseline (§ 5.2 — for reference only, do not cite in paper)
Retained to document the original preliminary experiment (one run, fold-0 split). CLIP's 86.25% here vs 90.15% in § 5.1 is single-run variance, not an architectural effect.
| Rank | Model | Acc (%) | 95% CI | F1 (%) | κ | Brier | ROC-AUC |
|---|---|---|---|---|---|---|---|
| 1 | densenet121 |
89.78 | [88.71, 90.80] | 92.26 | 0.879 | 0.148 | 0.9931 |
| 2 | dinov2_l |
89.50 | [88.47, 90.55] | 92.15 | 0.876 | 0.155 | 0.9938 |
| 3 | vgg19 |
89.31 | [88.22, 90.40] | 92.12 | 0.874 | 0.154 | 0.9930 |
| 4 | resnet101 |
89.25 | [88.19, 90.34] | 92.05 | 0.873 | 0.149 | 0.9941 |
| 5 | inception_v3 |
89.21 | [88.15, 90.28] | 91.97 | 0.873 | 0.157 | 0.9934 |
| 6 | resnet50 |
89.09 | [88.00, 90.12] | 91.87 | 0.871 | 0.147 | 0.9944 |
| 7 | swin_b |
86.85 | [85.69, 88.03] | 90.44 | 0.845 | 0.185 | 0.9904 |
| 8 | clip_openai |
86.25 | [85.10, 87.41] | 89.99 | 0.838 | 0.195 | 0.9896 |
| 9 | retfound |
83.88 | [82.64, 85.10] | 87.68 | 0.810 | 0.238 | 0.9838 |
| — | 9-Model Ensemble | 89.68 | [88.65, 90.74] | 92.25 | 0.878 | 0.144 | 0.9941 |
The ensemble row is identical in both tables — it is computed once on the fixed holdout and does not depend on the training run.
5.3 Statistical significance
All 15 pairwise Bonferroni-corrected McNemar tests among the 6 CNN/CLIP
models: p > 0.05 — no statistically significant pairwise differences.
Full χ² and p-value matrix: kfold/cnn_clip/mcnemar.json.
5.4 Calibration (ECE)
| Model | ECE | |
|---|---|---|
inception_v3 |
0.0194 | Best calibrated |
clip_openai |
0.0217 | |
vgg19 |
0.0228 | |
retfound |
0.0242 | |
resnet101 |
0.0243 | |
densenet121 |
0.0272 | |
swin_b |
0.0294 | |
dinov2_l |
0.0299 | |
resnet50 |
0.0339 | |
| 9-Model Ensemble | 0.0198 | Best overall |
Reliability diagrams: analysis/reliability_diagrams/reliability_<model>.png.
6. Reproducibility
6.1 Load a model (PyTorch ≥ 2.6)
import torch, timm
from huggingface_hub import hf_hub_download
# CNN backbone — any of: inception_v3, densenet121, vgg19, resnet101, resnet50
ckpt = hf_hub_download("DoB24/fundus-9model-benchmark",
"weights/inception_v3_v2_final.pth")
model = timm.create_model("inception_v3", num_classes=10)
state = torch.load(ckpt, map_location="cpu", weights_only=False)
model.load_state_dict(state["model"] if "model" in state else state)
model.eval()
# CLIP ViT-B/16
import open_clip, torch
from huggingface_hub import hf_hub_download
ckpt = hf_hub_download("DoB24/fundus-9model-benchmark",
"weights/clip_openai_v2_final.pth")
model, _, preprocess = open_clip.create_model_and_transforms(
"ViT-B-16", pretrained="openai")
state = torch.load(ckpt, map_location="cpu", weights_only=False)
model.load_state_dict(state["model"] if "model" in state else state)
model.eval()
6.2 Inference preprocessing
from torchvision import transforms
import cv2
from PIL import Image
def clahe_preprocess(img_path):
img = cv2.imread(str(img_path))
lab = cv2.cvtColor(img, cv2.COLOR_BGR2LAB)
l, a, b = cv2.split(lab)
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
lab = cv2.merge([clahe.apply(l), a, b])
img = cv2.cvtColor(lab, cv2.COLOR_LAB2RGB)
return Image.fromarray(img)
val_transform = transforms.Compose([
transforms.Resize((224, 224)), # use 299 for inception_v3
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225]),
])
6.3 Class index mapping
0: Central Serous Chorioretinopathy [Color Fundus]
1: Diabetic Retinopathy
2: Disc Edema
3: Glaucoma
4: Healthy
5: Macular Scar
6: Myopia
7: Pterygium
8: Retinal Detachment
9: Retinitis Pigmentosa
6.4 Quick inference
pip install torch torchvision timm open_clip_torch huggingface_hub pillow opencv-python
python code/inference_example.py path/to/fundus.jpg --model inception_v3
7. Files in this repository
| Path | Description |
|---|---|
weights/<model>_v2_final.pth ×9 |
Fine-tuned weights — dict with keys model, optimizer, epoch |
results/<model>_test.json ×9 |
Single-run holdout metrics (acc, F1, κ, Brier, ROC-AUC, per-class) |
results/<model>_test_preds.json ×9 |
Single-run labels + preds + probs (3 208 items) |
results/ensemble_report.json |
Ensemble + McNemar + conformal report (single-run) |
kfold/kfold_v2_summary.csv |
Authoritative 9-model 5-fold summary (machine-readable) |
kfold/cnn_clip/summary.json |
5-fold aggregated means + CI for CNN/CLIP models |
kfold/cnn_clip/<model>_kfold.json ×6 |
Per-fold val metrics for CNN/CLIP |
kfold/cnn_clip/<model>_test_preds.json ×6 |
5-fold pooled predictions |
kfold/cnn_clip/mcnemar.json |
15 pairwise McNemar tests |
kfold/foundation_fold{0-4}_{model}.json ×15 |
Per-fold test metrics for foundation models |
splits/holdout_split_augmented.json |
pHash-grouped 5-fold manifest (3.2 MB) |
analysis/confusion_matrices/cm_<model>.png ×10 |
Per-model confusion matrices |
analysis/roc_curves/roc_<model>.png ×10 |
One-vs-rest ROC curves |
analysis/reliability_diagrams/reliability_<model>.png ×10 |
Calibration reliability diagrams |
analysis/per_class_metrics.csv |
Precision / recall / F1 / support per model per class |
analysis/ece_summary.json |
ECE values all models |
gradcam/gradcam_<model>.png ×9 |
GradCAM / input-gradient saliency maps |
code/hparams.json |
Full hyperparameter table |
CITATION.cff |
Citation File Format |
8. Compute Disclosure
All 9 models trained across 5 folds on a single NVIDIA Tesla T4 (16 GB), PyTorch 2.11.0+cu128.
| Model | Approx. GPU-hours (5 folds total) |
|---|---|
| VGG-19 | 12.5 |
| ResNet-50 | 11.5 |
| ResNet-101 | 15.5 |
| DenseNet-121 | 14.0 |
| Inception-v3 | 11.0 |
| CLIP ViT-B/16 | 19.0 |
| DINOv2-L | 56.0 |
| Swin-B | 22.5 |
| RETFound | 47.0 |
| Ensemble + stats | 0.3 |
| Total | ~209 GPU-hours |
9. Citation
@mastersthesis{katiyo2026fundus,
author = {Katiyo, Daryl Panashe},
title = {Classification of Fundus Lesion Images Using Deep Learning Models},
school = {Xidian University},
year = {2026},
note = {Companion artifact: \url{https://huggingface.co/DoB24/fundus-9model-benchmark}}
}
@dataset{nayan2023fundus,
author = {Nayan, Asma U. and Saha, Sajib K. et al.},
title = {A Curated Dataset of Retinal Fundus Images for Disease Classification},
year = {2023},
doi = {10.17632/s9bfhswzjb.1},
url = {https://data.mendeley.com/datasets/s9bfhswzjb/1}
}
10. References
- Gulshan V. et al. "Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs." JAMA 316.22 (2016): 2402–2410.
- Ting D.S.W. et al. "Development and Validation of a Deep Learning System for Diabetic Retinopathy and Related Eye Diseases." JAMA 318.22 (2017): 2211–2223.
- Oquab M. et al. "DINOv2: Learning Robust Visual Features without Supervision." arXiv:2304.07193 (2023).
- Liu Z. et al. "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows." ICCV 2021.
- Zhou Y. et al. "A foundation model for generalizable disease detection from retinal images." Nature 622 (2023): 156–163.
- He K. et al. "Deep Residual Learning for Image Recognition." CVPR 2016.
- Simonyan K., Zisserman A. "Very Deep Convolutional Networks for Large-Scale Image Recognition." ICLR 2015.
- Huang G. et al. "Densely Connected Convolutional Networks." CVPR 2017.
- Szegedy C. et al. "Rethinking the Inception Architecture for Computer Vision." CVPR 2016.
- Radford A. et al. "Learning Transferable Visual Models From Natural Language Supervision." ICML 2021.
- Zhang H. et al. "mixup: Beyond Empirical Risk Minimization." ICLR 2018.
- Yun S. et al. "CutMix: Regularization Strategy to Train Strong Classifiers." ICCV 2019.
- Cubuk E.D. et al. "RandAugment: Practical Automated Data Augmentation." NeurIPS 2020.
- Vovk V., Gammerman A., Shafer G. Algorithmic Learning in a Random World. Springer, 2005.
- Bonferroni C.E. "Teoria statistica delle classi e calcolo delle probabilità." 1936.
11. License & Contact
Apache-2.0 for code and weights. Original Mendeley dataset: CC BY 4.0.
Questions / collaboration: open an Issue on the Hub repo.