YOLOv7

YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors

Abstract

YOLOv7 surpasses all known object detectors in both speed and accuracy in the range from 5 FPS to 160 FPS and has the highest accuracy 56.8% AP among all known real-time object detectors with 30 FPS or higher on GPU V100. YOLOv7-E6 object detector (56 FPS V100, 55.9% AP) outperforms both transformer-based detector SWIN-L Cascade-Mask R-CNN (9.2 FPS A100, 53.9% AP) by 509% in speed and 2% in accuracy, and convolutional-based detector ConvNeXt-XL Cascade-Mask R-CNN (8.6 FPS A100, 55.2% AP) by 551% in speed and 0.7% AP in accuracy, as well as YOLOv7 outperforms: YOLOR, YOLOX, Scaled-YOLOv4, YOLOv5, DETR, Deformable DETR, DINO-5scale-R50, ViT-Adapter-B and many other object detectors in speed and accuracy. Moreover, we train YOLOv7 only on MS COCO dataset from scratch without using any other datasets or pre-trained weights. Source code is released in this https URL.

YOLOv7-l-P5 model structure

Results and models

COCO

Backbone	Arch	Size	SyncBN	AMP	Mem (GB)	Box AP	Config	Download
YOLOv7-tiny	P5	640	Yes	Yes	2.7	37.5	config	model \| log
YOLOv7-l	P5	640	Yes	Yes	10.3	50.9	config	model \| log
YOLOv7-x	P5	640	Yes	Yes	13.7	52.8	config	model \| log
YOLOv7-w	P6	1280	Yes	Yes	27.0	54.1	config	model \| log
YOLOv7-e	P6	1280	Yes	Yes	42.5	55.1	config	model \| log

Note: In the official YOLOv7 code, the random_perspective data augmentation in COCO object detection task training uses mask annotation information, which leads to higher performance. Object detection should not use mask annotation, so only box annotation information is used in MMYOLO. We will use the mask annotation information in the instance segmentation task.

The performance is unstable and may fluctuate by about 0.3 mAP. The performance shown above is the best model.
If users need the weight of YOLOv7-e2e, they can train according to the configs provided by us, or convert the official weight according to the converter script.
fast means that YOLOv5DetDataPreprocessor and yolov5_collate are used for data preprocessing, which is faster for training, but less flexible for multitasking. Recommended to use fast version config if you only care about object detection.
SyncBN means use SyncBN, AMP indicates training with mixed precision.
We use 8x A100 for training, and the single-GPU batch size is 16. This is different from the official code.

Citation

@article{wang2022yolov7,
  title={{YOLOv7}: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors},
  author={Wang, Chien-Yao and Bochkovskiy, Alexey and Liao, Hong-Yuan Mark},
  journal={arXiv preprint arXiv:2207.02696},
  year={2022}
}