Fundus Lesion Image Classification — 9-Model Comparative Benchmark

Companion artifact for the Master's thesis "Classification of Fundus Lesion Images Using Deep Learning Models" (Xidian University, 2026), by Daryl Panashe Katiyo.

Reproducible weights, predictions, and full statistical analysis for nine deep-learning backbones evaluated on a 10-class colour-fundus dataset with a group-aware (perceptual-hash) 5-fold cross-validation protocol.

Important Note on Tables

Two result tables appear in this README. They measure different things and must not be mixed:

Table Protocol # runs Use for

§ 5.1 — 5-fold CV (authoritative) 5 independent train/val splits, fixed holdout test 5 Paper, model comparison, all citations

§ 5.2 — Single-run baseline One training run, fold-0 split 1 Legacy reference, documents CLIP variance

Always cite § 5.1 in the paper. The single-run baseline (§ 5.2) is retained for completeness — CLIP underperforms there (86.25%) due to a single unlucky initialisation versus 90.15% over 5 folds.

Table	Protocol	# runs	Use for
§ 5.1 — 5-fold CV (authoritative)	5 independent train/val splits, fixed holdout test	5	Paper, model comparison, all citations
§ 5.2 — Single-run baseline	One training run, fold-0 split	1	Legacy reference, documents CLIP variance

1. Abstract

Automatic interpretation of colour fundus photographs is a foundational task for screening prevalent blinding diseases such as diabetic retinopathy, glaucoma and age-related macular degeneration. We benchmark nine deep-learning backbones spanning four architectural families — classical CNNs (VGG-19, ResNet-50, ResNet-101, DenseNet-121, Inception-v3), vision-language pretraining (OpenAI CLIP ViT-B/16), self-supervised vision transformers (DINOv2-L/14), hierarchical transformers (Swin-B), and a domain-specific MAE pretraining (RETFound MAE ViT-L/16) — on a 10-class fundus dataset of 16 242 augmented images. To suppress augmentation-induced label leakage we construct a group-aware (perceptual-hash) stratified 5-fold split and report mean accuracy, F1, ROC-AUC, ECE, Cohen's κ, Brier score, bootstrap 95% confidence intervals, Bonferroni-corrected McNemar tests, and 90% Mondrian conformal sets.

Headline result (5-fold CV). Classical CNNs and CLIP ViT-B/16 are statistically indistinguishable at the top, with Inception-v3 leading at 90.18% accuracy (F1 = 92.54%, κ = 0.884, ROC-AUC = 0.9930). Contrary to expectations, foundation models pretrained on general vision (DINOv2-L: 89.61%, Swin-B: 87.00%) or fundus images (RETFound: 83.35%) do not outperform classical CNNs on this moderate-sized dataset. A soft-vote ensemble of all nine models reaches ROC-AUC = 0.9941 and accuracy = 89.68%.

2. Motivation & Model Selection

Modern fundus screening pipelines are increasingly built on pre-trained image backbones, but the question "which backbone family is best for fundus disease classification on a moderately-sized, imbalanced dataset?" has no consensus answer. We deliberately chose backbones that exercise four distinct inductive biases / pretraining regimes:

Family	Backbone(s)	Why included
Classical CNNs	VGG-19, ResNet-50, ResNet-101, DenseNet-121, Inception-v3	Established baselines used in virtually all prior fundus benchmarks (Gulshan 2016, Ting 2017). Locally-connected convolutions suit texture-dominant retinal pathology.
Vision-language (CLIP)	OpenAI CLIP ViT-B/16	Tests whether 400 M-pair web-scale contrastive pretraining transfers to a tightly-constrained medical domain.
Self-supervised ViT	DINOv2-L/14	State-of-the-art general-purpose features without language supervision (Oquab 2024).
Hierarchical ViT	Swin-B	Adds hierarchy + shifted windows; competitive on ImageNet at lower compute than ViT-L (Liu 2021).
Domain MAE	RETFound MAE ViT-L/16	Pretrained on 1.6 M colour fundus images (Zhou 2023, Nature); the strongest published prior on this modality.

This grid isolates three confounders: (i) scale (ResNet-50 vs ResNet-101; ViT-B vs ViT-L); (ii) pretraining modality (ImageNet supervised vs CLIP language-supervised vs DINOv2 self-supervised vs RETFound domain-MAE); and (iii) architecture class (CNN vs ViT vs hierarchical).

3. Dataset

Source. Mendeley Data (10 classes; 5 335 original images).
Augmentation. Class-balancing augmentation expanded the pool to 16 242 images (rotation, horizontal flip, brightness/contrast jitter, Gaussian blur). Each augmented image carries its source's diagnostic label.
Companion dataset: DoB24/fundus-10class-augmented.

3.1 Class distribution

Index	Class	Original	Augmented
0	Central Serous Chorioretinopathy	101	606
1	Diabetic Retinopathy	1 509	3 444
2	Disc Edema	127	762
3	Glaucoma	1 349	2 880
4	Healthy	1 024	2 676
5	Macular Scar	444	1 937
6	Myopia	500	2 251
7	Pterygium	17	102
8	Retinal Detachment	125	750
9	Retinitis Pigmentosa	139	834
—	Total	5 335	16 242

3.2 Group-aware splitting (data-leakage prevention)

Because the augmented set contains visually near-duplicate copies of each original image, a naïve train_test_split over the augmented pool would let models memorise patient-level identities. We prevent this by:

Computing a 64-bit perceptual hash (pHash) on every image.
Linking each augmented image to its nearest original at Hamming distance ≤ 8 → defines a group_id.
Running scikit-learn StratifiedGroupKFold(n_splits=5) so that all augmented children of a given original sit in exactly one fold.

The held-out test set is fixed across all 5 folds: 3 208 images (≈ 19.8% of the augmented pool). Exact manifest: splits/holdout_split_augmented.json (3.2 MB).

4. Training Protocol

4.1 CNN / CLIP backbones

Hyper-parameter	Value
Optimizer	AdamW (β₁=0.9, β₂=0.999, weight-decay=1×10⁻⁴)
Initial LR	2×10⁻⁴
LR schedule	3-epoch linear warm-up + cosine decay to 0
Epochs	Up to 60 (early stop patience=12 on val F1)
Batch size	32
Image size	224×224 (Inception-v3: 299×299)
Preprocessing	CLAHE (LAB L-channel) → RandAugment (n=2, m=9) → ImageNet normalisation
Imbalance	`WeightedRandomSampler` (weights ∝ 1/class_count)
Regularisation	MixUp (α=0.2) + CutMix (α=1.0, p=0.7)
Mixed precision	`torch.amp.autocast` + `GradScaler`
Test-time aug	6 views (centre + 4 corners + h-flip), soft-vote mean
Hardware	PyTorch 2.11 + CUDA 12.8, NVIDIA Tesla T4 (16 GB)

4.2 Foundation model backbones (DINOv2-L, Swin-B, RETFound)

Two-stage schedule per fold:

Stage	Layers trained	Epochs	Head LR	Backbone LR
Linear probe	Head only	20	1×10⁻³	frozen
Full fine-tune	All layers	15	1×10⁻⁴	1×10⁻⁵

Batch size: 24. Early stopping patience: 8 epochs on val F1.

4.3 Ensemble

F1-weighted soft-vote across all 9 models, using each model's validation F1 as the weight.

5. Results

All metrics are on the fixed 3 208-image holdout test set.

5.1 Five-fold cross-validation — authoritative results (use in paper)

Acc, F1, 95% CI, ROC-AUC, and ECE are means over 5 independent training runs. κ and Brier for CNN/CLIP are from 5-fold pooled predictions. For foundation models (†), κ and Brier are from single-run inference on the same holdout (fold-level predictions not stored); acc/F1/ROC/ECE are still 5-fold averages.

Rank	Model	Acc (%)	95% CI	F1 (%)	F1 95% CI	κ	Brier	ROC-AUC	ECE
1	`inception_v3`	90.18	[89.24, 91.24]	92.54	[91.68, 93.40]	0.884	0.150	0.9930	0.0194
2	`clip_openai`	90.15	[89.18, 91.24]	92.83	[92.00, 93.61]	0.884	0.140	0.9944	0.0217
3	`vgg19`	90.12	[89.09, 91.12]	92.59	[91.77, 93.41]	0.884	0.150	0.9933	0.0228
4	`resnet101`	90.09	[89.15, 91.12]	92.63	[91.77, 93.47]	0.883	0.140	0.9941	0.0243
5	`densenet121`	89.65	[88.62, 90.71]	92.29	[91.37, 93.07]	0.878	0.150	0.9937	0.0272
6	`dinov2_l`	89.61	[88.57, 90.64]	92.27	[91.38, 93.08]	0.876†	0.160†	0.9934	0.0299
7	`resnet50`	89.50	[88.40, 90.59]	92.20	[91.34, 93.03]	0.876	0.140	0.9945	0.0339
8	`swin_b`	87.00	[85.92, 88.15]	90.26	[89.31, 91.19]	0.845†	0.190†	0.9896	0.0294
9	`retfound`	83.35	[82.16, 84.65]	87.27	[86.22, 88.35]	0.810†	0.240†	0.9834	0.0242
—	9-Model Ensemble	89.68	[88.65, 90.74]	92.25	—	0.878	0.144	0.9941	0.0198

† κ and Brier for DINOv2-L, Swin-B, RETFound from single-run holdout inference. All other foundation model metrics are 5-fold CV averages.

Key findings:

Top-4 models (Inception-v3, CLIP, VGG-19, ResNet-101) are statistically indistinguishable: all 6 pairwise McNemar tests p > 0.05 after Bonferroni correction (see kfold/cnn_clip/mcnemar.json).
DINOv2-L (89.61%) matches DenseNet-121 (89.65%) within noise despite having 7× more parameters.
RETFound, pretrained on 1.6 M fundus images, ranks last — the LP+FT protocol with patience=8 may be insufficient on this dataset size.
The ensemble gains ROC-AUC parity with the best individual model.

Machine-readable full table: kfold/kfold_v2_summary.csv

5.2 Single-run baseline (§ 5.2 — for reference only, do not cite in paper)

Retained to document the original preliminary experiment (one run, fold-0 split). CLIP's 86.25% here vs 90.15% in § 5.1 is single-run variance, not an architectural effect.

Rank	Model	Acc (%)	95% CI	F1 (%)	κ	Brier	ROC-AUC
1	`densenet121`	89.78	[88.71, 90.80]	92.26	0.879	0.148	0.9931
2	`dinov2_l`	89.50	[88.47, 90.55]	92.15	0.876	0.155	0.9938
3	`vgg19`	89.31	[88.22, 90.40]	92.12	0.874	0.154	0.9930
4	`resnet101`	89.25	[88.19, 90.34]	92.05	0.873	0.149	0.9941
5	`inception_v3`	89.21	[88.15, 90.28]	91.97	0.873	0.157	0.9934
6	`resnet50`	89.09	[88.00, 90.12]	91.87	0.871	0.147	0.9944
7	`swin_b`	86.85	[85.69, 88.03]	90.44	0.845	0.185	0.9904
8	`clip_openai`	86.25	[85.10, 87.41]	89.99	0.838	0.195	0.9896
9	`retfound`	83.88	[82.64, 85.10]	87.68	0.810	0.238	0.9838
—	9-Model Ensemble	89.68	[88.65, 90.74]	92.25	0.878	0.144	0.9941

The ensemble row is identical in both tables — it is computed once on the fixed holdout and does not depend on the training run.

5.3 Statistical significance

All 15 pairwise Bonferroni-corrected McNemar tests among the 6 CNN/CLIP models: p > 0.05 — no statistically significant pairwise differences. Full χ² and p-value matrix: kfold/cnn_clip/mcnemar.json.

5.4 Calibration (ECE)

Model	ECE
`inception_v3`	0.0194	Best calibrated
`clip_openai`	0.0217
`vgg19`	0.0228
`retfound`	0.0242
`resnet101`	0.0243
`densenet121`	0.0272
`swin_b`	0.0294
`dinov2_l`	0.0299
`resnet50`	0.0339
9-Model Ensemble	0.0198	Best overall

Reliability diagrams: analysis/reliability_diagrams/reliability_<model>.png.

6. Reproducibility

6.1 Load a model (PyTorch ≥ 2.6)

import torch, timm
from huggingface_hub import hf_hub_download

# CNN backbone — any of: inception_v3, densenet121, vgg19, resnet101, resnet50
ckpt = hf_hub_download("DoB24/fundus-9model-benchmark",
                       "weights/inception_v3_v2_final.pth")
model = timm.create_model("inception_v3", num_classes=10)
state = torch.load(ckpt, map_location="cpu", weights_only=False)
model.load_state_dict(state["model"] if "model" in state else state)
model.eval()

# CLIP ViT-B/16
import open_clip, torch
from huggingface_hub import hf_hub_download

ckpt = hf_hub_download("DoB24/fundus-9model-benchmark",
                       "weights/clip_openai_v2_final.pth")
model, _, preprocess = open_clip.create_model_and_transforms(
    "ViT-B-16", pretrained="openai")
state = torch.load(ckpt, map_location="cpu", weights_only=False)
model.load_state_dict(state["model"] if "model" in state else state)
model.eval()

6.2 Inference preprocessing

from torchvision import transforms
import cv2
from PIL import Image

def clahe_preprocess(img_path):
    img = cv2.imread(str(img_path))
    lab = cv2.cvtColor(img, cv2.COLOR_BGR2LAB)
    l, a, b = cv2.split(lab)
    clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
    lab = cv2.merge([clahe.apply(l), a, b])
    img = cv2.cvtColor(lab, cv2.COLOR_LAB2RGB)
    return Image.fromarray(img)

val_transform = transforms.Compose([
    transforms.Resize((224, 224)),   # use 299 for inception_v3
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406],
                         [0.229, 0.224, 0.225]),
])

6.3 Class index mapping

0: Central Serous Chorioretinopathy [Color Fundus]
1: Diabetic Retinopathy
2: Disc Edema
3: Glaucoma
4: Healthy
5: Macular Scar
6: Myopia
7: Pterygium
8: Retinal Detachment
9: Retinitis Pigmentosa

6.4 Quick inference

pip install torch torchvision timm open_clip_torch huggingface_hub pillow opencv-python
python code/inference_example.py path/to/fundus.jpg --model inception_v3

7. Files in this repository

Path	Description
`weights/<model>_v2_final.pth` ×9	Fine-tuned weights — dict with keys `model`, `optimizer`, `epoch`
`results/<model>_test.json` ×9	Single-run holdout metrics (acc, F1, κ, Brier, ROC-AUC, per-class)
`results/<model>_test_preds.json` ×9	Single-run labels + preds + probs (3 208 items)
`results/ensemble_report.json`	Ensemble + McNemar + conformal report (single-run)
`kfold/kfold_v2_summary.csv`	Authoritative 9-model 5-fold summary (machine-readable)
`kfold/cnn_clip/summary.json`	5-fold aggregated means + CI for CNN/CLIP models
`kfold/cnn_clip/<model>_kfold.json` ×6	Per-fold val metrics for CNN/CLIP
`kfold/cnn_clip/<model>_test_preds.json` ×6	5-fold pooled predictions
`kfold/cnn_clip/mcnemar.json`	15 pairwise McNemar tests
`kfold/foundation_fold{0-4}_{model}.json` ×15	Per-fold test metrics for foundation models
`splits/holdout_split_augmented.json`	pHash-grouped 5-fold manifest (3.2 MB)
`analysis/confusion_matrices/cm_<model>.png` ×10	Per-model confusion matrices
`analysis/roc_curves/roc_<model>.png` ×10	One-vs-rest ROC curves
`analysis/reliability_diagrams/reliability_<model>.png` ×10	Calibration reliability diagrams
`analysis/per_class_metrics.csv`	Precision / recall / F1 / support per model per class
`analysis/ece_summary.json`	ECE values all models
`gradcam/gradcam_<model>.png` ×9	GradCAM / input-gradient saliency maps
`code/hparams.json`	Full hyperparameter table
`CITATION.cff`	Citation File Format

8. Compute Disclosure

All 9 models trained across 5 folds on a single NVIDIA Tesla T4 (16 GB), PyTorch 2.11.0+cu128.

Model	Approx. GPU-hours (5 folds total)
VGG-19	12.5
ResNet-50	11.5
ResNet-101	15.5
DenseNet-121	14.0
Inception-v3	11.0
CLIP ViT-B/16	19.0
DINOv2-L	56.0
Swin-B	22.5
RETFound	47.0
Ensemble + stats	0.3
Total	~209 GPU-hours

9. Citation

@mastersthesis{katiyo2026fundus,
  author  = {Katiyo, Daryl Panashe},
  title   = {Classification of Fundus Lesion Images Using Deep Learning Models},
  school  = {Xidian University},
  year    = {2026},
  note    = {Companion artifact: \url{https://huggingface.co/DoB24/fundus-9model-benchmark}}
}

@dataset{nayan2023fundus,
  author  = {Nayan, Asma U. and Saha, Sajib K. et al.},
  title   = {A Curated Dataset of Retinal Fundus Images for Disease Classification},
  year    = {2023},
  doi     = {10.17632/s9bfhswzjb.1},
  url     = {https://data.mendeley.com/datasets/s9bfhswzjb/1}
}

10. References

Gulshan V. et al. "Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs." JAMA 316.22 (2016): 2402–2410.
Ting D.S.W. et al. "Development and Validation of a Deep Learning System for Diabetic Retinopathy and Related Eye Diseases." JAMA 318.22 (2017): 2211–2223.
Oquab M. et al. "DINOv2: Learning Robust Visual Features without Supervision." arXiv:2304.07193 (2023).
Liu Z. et al. "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows." ICCV 2021.
Zhou Y. et al. "A foundation model for generalizable disease detection from retinal images." Nature 622 (2023): 156–163.
He K. et al. "Deep Residual Learning for Image Recognition." CVPR 2016.
Simonyan K., Zisserman A. "Very Deep Convolutional Networks for Large-Scale Image Recognition." ICLR 2015.
Huang G. et al. "Densely Connected Convolutional Networks." CVPR 2017.
Szegedy C. et al. "Rethinking the Inception Architecture for Computer Vision." CVPR 2016.
Radford A. et al. "Learning Transferable Visual Models From Natural Language Supervision." ICML 2021.
Zhang H. et al. "mixup: Beyond Empirical Risk Minimization." ICLR 2018.
Yun S. et al. "CutMix: Regularization Strategy to Train Strong Classifiers." ICCV 2019.
Cubuk E.D. et al. "RandAugment: Practical Automated Data Augmentation." NeurIPS 2020.
Vovk V., Gammerman A., Shafer G. Algorithmic Learning in a Random World. Springer, 2005.
Bonferroni C.E. "Teoria statistica delle classi e calcolo delle probabilità." 1936.

11. License & Contact

Apache-2.0 for code and weights. Original Mendeley dataset: CC BY 4.0.

Questions / collaboration: open an Issue on the Hub repo.

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train DoB24/fundus-9model-benchmark

Papers for DoB24/fundus-9model-benchmark

DINOv2: Learning Robust Visual Features without Supervision

Paper • 2304.07193 • Published Apr 14, 2023 • 10

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Paper • 2103.14030 • Published Mar 25, 2021 • 5