ViTDet: Exploring Plain Vision Transformer Backbones for Object Detection

Yanghao Li, Hanzi Mao, Ross Girshick†, Kaiming He†

[arXiv] [BibTeX]

In this repository, we provide configs and models in Detectron2 for ViTDet as well as MViTv2 and Swin backbones with our implementation and settings as described in ViTDet paper.

Pretrained Models

COCO

Mask R-CNN

Name	pre-train	train time (s/im)	inference time (s/im)	train mem (GB)	box AP	mask AP	model id	download
ViTDet, ViT-B	IN1K, MAE	0.314	0.079	10.9	51.6	45.9	325346929	model
ViTDet, ViT-L	IN1K, MAE	0.603	0.125	20.9	55.5	49.2	325599698	model
ViTDet, ViT-H	IN1K, MAE	1.098	0.178	31.5	56.7	50.2	329145471	model

Cascade Mask R-CNN

Name	pre-train	train time (s/im)	inference time (s/im)	train mem (GB)	box AP	mask AP	model id	download
Swin-B	IN21K, sup	0.389	0.077	8.7	53.9	46.2	342979038	model
Swin-L	IN21K, sup	0.508	0.097	12.6	55.0	47.2	342979186	model
MViTv2-B	IN21K, sup	0.475	0.090	8.9	55.6	48.1	325820315	model
MViTv2-L	IN21K, sup	0.844	0.157	19.7	55.7	48.3	325607715	model
MViTv2-H	IN21K, sup	1.655	0.285	18.4*	55.9	48.3	326187358	model
ViTDet, ViT-B	IN1K, MAE	0.362	0.089	12.3	54.0	46.7	325358525	model
ViTDet, ViT-L	IN1K, MAE	0.643	0.142	22.3	57.6	50.0	328021305	model
ViTDet, ViT-H	IN1K, MAE	1.137	0.196	32.9	58.7	51.0	328730692	model

LVIS

Mask R-CNN

Name	pre-train	train time (s/im)	inference time (s/im)	train mem (GB)	box AP	mask AP	model id	download
ViTDet, ViT-B	IN1K, MAE	0.317	0.085	14.4	40.2	38.2	329225748	model
ViTDet, ViT-L	IN1K, MAE	0.576	0.137	24.7	46.1	43.6	329211570	model
ViTDet, ViT-H	IN1K, MAE	1.059	0.186	35.3	49.1	46.0	332434656	model

Cascade Mask R-CNN

Name	pre-train	train time (s/im)	inference time (s/im)	train mem (GB)	box AP	mask AP	model id	download
Swin-B	IN21K, sup	0.368	0.090	11.5	44.0	39.6	329222304	model
Swin-L	IN21K, sup	0.486	0.105	13.8	46.0	41.4	329222724	model
MViTv2-B	IN21K, sup	0.475	0.100	11.8	46.3	42.0	329477206	model
MViTv2-L	IN21K, sup	0.844	0.172	21.0	49.4	44.2	329661552	model
MViTv2-H	IN21K, sup	1.661	0.290	21.3*	49.5	44.1	330445165	model
ViTDet, ViT-B	IN1K, MAE	0.356	0.099	15.2	43.0	38.9	329226874	model
ViTDet, ViT-L	IN1K, MAE	0.629	0.150	24.9	49.2	44.5	329042206	model
ViTDet, ViT-H	IN1K, MAE	1.100	0.204	35.5	51.5	46.6	332552778	model

Note: Unlike the system-level comparisons in the paper, these models use a lower resolution (1024 instead of 1280) and standard NMS (instead of soft NMS). As a result, they have slightly lower box and mask AP.

We observed higher variance on LVIS evalution results compared to COCO. For example, the standard deviations of box AP and mask AP were 0.30% (compared to 0.10% on COCO) when we trained ViTDet, ViT-B five times with varying random seeds.

The above models were trained and measured on 8-node with 64 NVIDIA A100 GPUs in total. *: Activation checkpointing is used.

Training

All configs can be trained with:

../../tools/lazyconfig_train_net.py --config-file configs/path/to/config.py

By default, we use 64 GPUs with batch size as 64 for training.

Evaluation

Model evaluation can be done similarly:

../../tools/lazyconfig_train_net.py --config-file configs/path/to/config.py --eval-only train.init_checkpoint=/path/to/model_checkpoint

Citing ViTDet

If you use ViTDet, please use the following BibTeX entry.

@article{li2022exploring,
  title={Exploring plain vision transformer backbones for object detection},
  author={Li, Yanghao and Mao, Hanzi and Girshick, Ross and He, Kaiming},
  journal={arXiv preprint arXiv:2203.16527},
  year={2022}
}