YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

DOSOD
A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space


Yonghao He1,*,🌟 , Hu Su2,*,πŸ“§, Haiyong Yu1,*, Cong Yang3, Wei Sui1, Cong Wang1, Song Liu4,πŸ“§

* Equal contribution, 🌟 Project lead, πŸ“§ Corresponding author

1 D-Robotics,
2 State Key Laboratory of Multimodal Artificial Intelligence Systems(MAIS), Institute of Automation of Chinese Academy of Sciences,
3 BeeLab, School of Future Science and Engineering, Soochow University,
4 the School of Information Science and Technology, ShanghaiTech University

arxiv paper license

1. Introduction

1.1 Brief Introduction of DOSOD

Thanks to the new SOTA in open-vocabulary object detection established by YOLO-World, open-vocabulary detection has been extensively applied in various scenarios. Real-time open-vocabulary detection has attracted significant attention. In our paper, Decoupled Open-Set Object Detection (DOSOD) is proposed as a practical and highly efficient solution for supporting real-time OSOD tasks in robotic systems. Specifically, DOSOD is constructed based on the YOLO-World pipeline by integrating a vision-language model (VLM) with a detector. A Multilayer Perceptron (MLP) adaptor is developed to convert text embeddings extracted by the VLM into a joint space, within which the detector learns the region representations of class-agnostic proposals. Cross-modality features are directly aligned in the joint space, avoiding the complex feature interactions and thereby improving computational efficiency. DOSOD functions like a traditional closed-set detector during the testing phase, effectively bridging the gap between closed-set and open-set detection.

2. Model Overview

Following YOLO-World, we also pre-trained DOSOD-S/M/L from scratch on public datasets and conducted zero-shot evaluation on the LVIS minival and COCO val2017. All pre-trained models are released.

2.1 Zero-shot Evaluation on LVIS minival

model Pre-train Data Size APmini APr APc APf weights
O365+GoldG 640 24.3 16.6 22.1 27.7 HF Checkpoints πŸ€—
O365+GoldG 640 28.6 19.7 26.6 31.9 HF Checkpoints πŸ€—
O365+GoldG 640 32.5 22.3 30.6 36.1 HF Checkpoints πŸ€—
O365+GoldG 640 26.2 19.1 23.6 29.8 HF Checkpoints πŸ€—
O365+GoldG 640 31.0 23.8 29.2 33.9 HF Checkpoints πŸ€—
O365+GoldG 640 35.0 27.1 32.8 38.3 HF Checkpoints πŸ€—
YOLO-Worldv2-S O365+GoldG 640 22.7 16.3 20.8 25.5 HF Checkpoints πŸ€—
YOLO-Worldv2-M O365+GoldG 640 30.0 25.0 27.2 33.4 HF Checkpoints πŸ€—
YOLO-Worldv2-L O365+GoldG 640 33.0 22.6 32.0 35.8 HF Checkpoints πŸ€—
DOSOD-S O365+GoldG 640 26.7 19.9 25.1 29.3 HF Checkpoints πŸ€—
DOSOD-M O365+GoldG 640 31.3 25.7 29.6 33.7 HF Checkpoints πŸ€—
DOSOD-L O365+GoldG 640 34.4 29.1 32.6 36.6 HF Checkpoints πŸ€—

NOTE: The results of YOLO-Worldv1 from repo and paper are different.

2.2 Zero-shot Inference on COCO dataset

model Pre-train Data Size AP AP50 AP75
O365+GoldG 640 37.6 52.3 40.7
O365+GoldG 640 42.8 58.3 46.4
O365+GoldG 640 44.4 59.8 48.3
YOLO-Worldv2-S O365+GoldG 640 37.5 52.0 40.7
YOLO-Worldv2-M O365+GoldG 640 42.8 58.2 46.7
YOLO-Worldv2-L O365+GoldG 640 45.4 61.0 49.4
DOSOD-S O365+GoldG 640 36.1 51.0 39.1
DOSOD-M O365+GoldG 640 41.7 57.1 45.2
DOSOD-L O365+GoldG 640 44.6 60.5 48.4

2.3 Latency On RTX 4090

We utilize the tool of trtexec in TensorRT 8.6.1.6 to assess the latency in FP16 mode. All models are re-parameterized with 80 categories from COCO. Log info can be found by clicking the FPS.

model Params FPS
YOLO-Worldv1-S 13.32M 1007
YOLO-Worldv1-M 28.93M 702
YOLO-Worldv1-L 47.38M 494
YOLO-Worldv2-S 12.66M 1221
YOLO-Worldv2-M 28.20M 771
YOLO-Worldv2-L 46.62M 553
DOSOD-S 11.48M 1582
DOSOD-M 26.31M 922
DOSOD-L 44.19M 632

NOTE: FPS = 1000 / GPU Compute Time[mean]

2.4 Latency On RDK X5

We evaluate the real-time performance of the YOLO-World-v2 model and our DOSOD model on the development kit of D-Robotics RDK X5. The models are re-parameterized with 1203 categories defined in LVIS. We run the models on the RDK X5 using either 1 thread or 8 threads with INT8 or INT16 quantization modes.

model FPS (1 thread) FPS (8 threads)
YOLO-Worldv2-S
(INT16/INT8)
5.962/11.044 6.386/12.590
YOLO-Worldv2-M
(INT16/INT8)
4.136/7.290 4.340/7.930
YOLO-Worldv2-L
(INT16/INT8)
2.958/5.377 3.060/5.720
DOSOD-S
(INT16/INT8)
12.527/31.020 14.657/47.328
DOSOD-M
(INT16/INT8)
8.531/20.238 9.471/26.36
DOSOD-L
(INT16/INT8)
5.663/12.799 6.069/14.939
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.