Searching for better (Full) ImageNet ViT Baselines

Community Article Published August 26, 2024

timm 1.0.9 was just released. Included are a few new ImageNet-12k and ImageNet-12k -> ImageNet-1k weights in my Searching for Better ViT Baselines series.

I'd like to highlight these models as they're on the pareto front for ImageNet-12k / ImageNet-22k models. It is interesting to look at models with comparable ImageNet-22k fine-tunes to see how competitive (near) vanilla ViTs are with other architectures. With optimized attention kernels enabled (default in timm), they are well ahead of Swin and holding up just fine relative to ConvNeXt, etc.

Something else worth pointing out, the deit3 model weights are quite remarkable and underappreciated set of weights. The upper end of my sbb weights are matching deit3 at equivalent compute -- it's also a great recipe. Though, one of my goals with sbb recipes was to allow easier fine-tuning. In opting for a less exotic augmentation scheme, sticking with AdamW, and sacrificing some top-1 (higher weight-decay), I feel that was achieved. Through several fine-tune trials I've found the sbb ViT weights to be easier to fit to other, especially smaller datasets (Oxford Pets, RESISC, etc) w/ short runs.

NOTE: all throughput measurements were done on an RTX 4090, AMP /w torch.compile() enabled, PyTorch 2.4, Cuda 12.4.

image/png

Bold rows: Pareto frontier models

model img_size samples_per_sec top1 top5 param_count
deit3_base_patch16_224.fb_in22k_ft_in1k 224 3326.85 85.73 97.75 86.59
vit_betwixt_patch16_reg4_gap_256.sbb2_e200_in12k_ft_in1k 256 3302.28 85.73 97.61 60.40
vit_base_patch16_224.augreg2_in21k_ft_in1k 224 3278.15 85.11 97.54 86.57
vit_base_patch16_224.augreg_in21k_ft_in1k 224 3274.99 84.53 97.30 86.57
vit_mediumd_patch16_reg4_gap_256.sbb2_e200_in12k_ft_in1k 256 2761.64 86.60 97.94 64.11
caformer_m36.sail_in22k_ft_in1k 224 2345.11 86.61 98.04 56.20
convformer_m36.sail_in22k_ft_in1k 224 2319.68 86.15 97.85 57.05
swin_base_patch4_window7_224.ms_in22k_ft_in1k 224 2176.48 85.27 97.57 87.77
regnety_160.sw_in12k_ft_in1k 224 2098.25 85.59 97.67 83.59
coatnet_2_rw_224.sw_in12k_ft_in1k 224 1753.63 86.58 97.90 73.87
vit_betwixt_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k 384 1467.64 86.60 98.02 60.60
convnext_large.fb_in22k_ft_in1k 224 1457.60 86.61 98.04 197.77
convnext_small.in12k_ft_in1k_384 384 1350.43 86.19 97.92 50.22
seresnextaa101d_32x8d.sw_in12k_ft_in1k_288 288 1297.79 86.54 98.09 93.59
regnety_160.sw_in12k_ft_in1k 288 1260.01 86.03 97.83 83.59
swin_large_patch4_window7_224.ms_in22k_ft_in1k 224 1243.73 86.33 97.88 196.53
vit_mediumd_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k 384 1214.59 87.44 98.26 64.27
deit3_base_patch16_384.fb_in22k_ft_in1k 384 1098.30 86.74 98.11 86.88
deit3_large_patch16_224.fb_in22k_ft_in1k 224 1042.41 86.99 98.24 304.37
vit_large_patch16_224.augreg_in21k_ft_in1k 224 1041.47 85.85 97.83 304.33
seresnextaa101d_32x8d.sw_in12k_ft_in1k_288 320 1035.83 86.72 98.18 93.59
convnext_xlarge.fb_in22k_ft_in1k 224 921.30 86.97 98.20 350.20
convnext_large.fb_in22k_ft_in1k 288 881.61 87.01 98.21 197.77
caformer_m36.sail_in22k_ft_in1k_384 384 794.45 87.47 98.31 56.20
efficientnet_b5.sw_in12k_ft_in1k 448 729.86 85.89 97.74 30.39
convnext_xlarge.fb_in22k_ft_in1k 288 559.14 87.37 98.33 350.20
swin_base_patch4_window12_384.ms_in22k_ft_in1k 384 522.86 86.44 98.07 87.90
convnext_large.fb_in22k_ft_in1k_384 384 500.83 87.46 98.38 197.77
maxxvitv2_rmlp_base_rw_384.sw_in12k_ft_in1k 384 456.17 87.48 98.37 116.09
coatnet_rmlp_2_rw_384.sw_in12k_ft_in1k 384 404.42 87.40 98.31 73.88
seresnextaa201d_32x8d.sw_in12k_ft_in1k_384 384 365.65 87.31 98.33 149.39
deit3_large_patch16_384.fb_in22k_ft_in1k 384 342.41 87.73 98.51 304.76
vit_large_patch16_384.augreg_in21k_ft_in1k 384 338.21 87.09 98.31 304.72
swin_large_patch4_window12_384.ms_in22k_ft_in1k 384 315.38 87.14 98.23 196.74
swinv2_base_window12to24_192to384.ms_in22k_ft_in1k 384 297.03 87.14 98.23 87.92
swinv2_large_window12to24_192to384.ms_in22k_ft_in1k 384 186.30 87.47 98.26 196.74