Searching for better (Full) ImageNet ViT Baselines

Community Article Published August 26, 2024

timm 1.0.9 was just released. Included are a few new ImageNet-12k and ImageNet-12k -> ImageNet-1k weights in my Searching for Better ViT Baselines series.

model	top1	top5	param_count	img_size
vit_mediumd_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k	87.438	98.256	64.11	384
vit_mediumd_patch16_reg4_gap_256.sbb2_e200_in12k_ft_in1k	86.608	97.934	64.11	256
vit_betwixt_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k	86.594	98.02	60.4	384
vit_betwixt_patch16_reg4_gap_256.sbb2_e200_in12k_ft_in1k	85.734	97.61	60.4	256

I'd like to highlight these models as they're on the pareto front for ImageNet-12k / ImageNet-22k models. It is interesting to look at models with comparable ImageNet-22k fine-tunes to see how competitive (near) vanilla ViTs are with other architectures. With optimized attention kernels enabled (default in timm), they are well ahead of Swin and holding up just fine relative to ConvNeXt, etc.

Something else worth pointing out, the deit3 model weights are quite remarkable and underappreciated set of weights. The upper end of my sbb weights are matching deit3 at equivalent compute -- it's also a great recipe. Though, one of my goals with sbb recipes was to allow easier fine-tuning. In opting for a less exotic augmentation scheme, sticking with AdamW, and sacrificing some top-1 (higher weight-decay), I feel that was achieved. Through several fine-tune trials I've found the sbb ViT weights to be easier to fit to other, especially smaller datasets (Oxford Pets, RESISC, etc) w/ short runs.

NOTE: all throughput measurements were done on an RTX 4090, AMP /w torch.compile() enabled, PyTorch 2.4, Cuda 12.4.

Bold rows: Pareto frontier models

model	img_size	samples_per_sec	top1	top5	param_count
deit3_base_patch16_224.fb_in22k_ft_in1k	224	3326.85	85.73	97.75	86.59
vit_betwixt_patch16_reg4_gap_256.sbb2_e200_in12k_ft_in1k	256	3302.28	85.73	97.61	60.40
vit_base_patch16_224.augreg2_in21k_ft_in1k	224	3278.15	85.11	97.54	86.57
vit_base_patch16_224.augreg_in21k_ft_in1k	224	3274.99	84.53	97.30	86.57
vit_mediumd_patch16_reg4_gap_256.sbb2_e200_in12k_ft_in1k	256	2761.64	86.60	97.94	64.11
caformer_m36.sail_in22k_ft_in1k	224	2345.11	86.61	98.04	56.20
convformer_m36.sail_in22k_ft_in1k	224	2319.68	86.15	97.85	57.05
swin_base_patch4_window7_224.ms_in22k_ft_in1k	224	2176.48	85.27	97.57	87.77
regnety_160.sw_in12k_ft_in1k	224	2098.25	85.59	97.67	83.59
coatnet_2_rw_224.sw_in12k_ft_in1k	224	1753.63	86.58	97.90	73.87
vit_betwixt_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k	384	1467.64	86.60	98.02	60.60
convnext_large.fb_in22k_ft_in1k	224	1457.60	86.61	98.04	197.77
convnext_small.in12k_ft_in1k_384	384	1350.43	86.19	97.92	50.22
seresnextaa101d_32x8d.sw_in12k_ft_in1k_288	288	1297.79	86.54	98.09	93.59
regnety_160.sw_in12k_ft_in1k	288	1260.01	86.03	97.83	83.59
swin_large_patch4_window7_224.ms_in22k_ft_in1k	224	1243.73	86.33	97.88	196.53
vit_mediumd_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k	384	1214.59	87.44	98.26	64.27
deit3_base_patch16_384.fb_in22k_ft_in1k	384	1098.30	86.74	98.11	86.88
deit3_large_patch16_224.fb_in22k_ft_in1k	224	1042.41	86.99	98.24	304.37
vit_large_patch16_224.augreg_in21k_ft_in1k	224	1041.47	85.85	97.83	304.33
seresnextaa101d_32x8d.sw_in12k_ft_in1k_288	320	1035.83	86.72	98.18	93.59
convnext_xlarge.fb_in22k_ft_in1k	224	921.30	86.97	98.20	350.20
convnext_large.fb_in22k_ft_in1k	288	881.61	87.01	98.21	197.77
caformer_m36.sail_in22k_ft_in1k_384	384	794.45	87.47	98.31	56.20
efficientnet_b5.sw_in12k_ft_in1k	448	729.86	85.89	97.74	30.39
convnext_xlarge.fb_in22k_ft_in1k	288	559.14	87.37	98.33	350.20
swin_base_patch4_window12_384.ms_in22k_ft_in1k	384	522.86	86.44	98.07	87.90
convnext_large.fb_in22k_ft_in1k_384	384	500.83	87.46	98.38	197.77
maxxvitv2_rmlp_base_rw_384.sw_in12k_ft_in1k	384	456.17	87.48	98.37	116.09
coatnet_rmlp_2_rw_384.sw_in12k_ft_in1k	384	404.42	87.40	98.31	73.88
seresnextaa201d_32x8d.sw_in12k_ft_in1k_384	384	365.65	87.31	98.33	149.39
deit3_large_patch16_384.fb_in22k_ft_in1k	384	342.41	87.73	98.51	304.76
vit_large_patch16_384.augreg_in21k_ft_in1k	384	338.21	87.09	98.31	304.72
swin_large_patch4_window12_384.ms_in22k_ft_in1k	384	315.38	87.14	98.23	196.74
swinv2_base_window12to24_192to384.ms_in22k_ft_in1k	384	297.03	87.14	98.23	87.92
swinv2_large_window12to24_192to384.ms_in22k_ft_in1k	384	186.30	87.47	98.26	196.74

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Upvote