Spaces:
Build error
Build error
Update README.md
Browse files
README.md
CHANGED
@@ -1,77 +1,46 @@
|
|
1 |
---
|
|
|
|
|
|
|
|
|
|
|
2 |
license: apache-2.0
|
3 |
-
|
4 |
-
|
|
|
5 |
---
|
|
|
6 |
<div align="center">
|
7 |
-
<
|
|
|
|
|
8 |
<br>
|
9 |
-
<a href="https://scholar.google.com/citations?hl=zh-CN&user=PH8rJHYAAAAJ">Tianheng Cheng
|
10 |
-
<a href="https://linsong.info/">Lin Song
|
11 |
-
<a href="
|
|
|
12 |
<a href="http://eic.hust.edu.cn/professor/liuwenyu/"> Wenyu Liu</a><sup><span>3</span></sup>,
|
13 |
-
<a href="
|
14 |
-
<a href="https://scholar.google.com/citations?user=4oXBp9UAAAAJ&hl=en">Ying Shan</a><sup><span>1,2</span></sup>
|
15 |
</br>
|
16 |
|
17 |
-
\* Equal contribution 🌟 Project lead 📧 Corresponding author
|
18 |
-
|
19 |
<sup>1</sup> Tencent AI Lab, <sup>2</sup> ARC Lab, Tencent PCG
|
20 |
<sup>3</sup> Huazhong University of Science and Technology
|
21 |
<br>
|
22 |
<div>
|
23 |
|
24 |
-
[![arxiv paper](https://img.shields.io/badge/
|
25 |
-
[![
|
26 |
-
<a href="https://colab.research.google.com/github/AILab-CVC/YOLO-World/blob/master/inference.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>
|
27 |
-
[![demo](https://img.shields.io/badge/🤗HugginngFace-Spaces-orange)](https://huggingface.co/spaces/stevengrove/YOLO-World)
|
28 |
-
[![Replicate](https://replicate.com/zsxkib/yolo-world/badge)](https://replicate.com/zsxkib/yolo-world)
|
29 |
-
[![hfpaper](https://img.shields.io/badge/🤗HugginngFace-Paper-yellow)](https://huggingface.co/papers/2401.17270)
|
30 |
[![license](https://img.shields.io/badge/License-GPLv3.0-blue)](LICENSE)
|
31 |
-
[![yoloworldseg](https://img.shields.io/badge/YOLOWorldxEfficientSAM-🤗Spaces-orange)](https://huggingface.co/spaces/SkalskiP/YOLO-World)
|
32 |
-
[![yologuide](https://img.shields.io/badge/📖Notebook-roboflow-purple)](https://supervision.roboflow.com/develop/notebooks/zero-shot-object-detection-with-yolo-world)
|
33 |
-
[![deploy](https://media.roboflow.com/deploy.svg)](https://inference.roboflow.com/foundation/yolo_world/)
|
34 |
|
35 |
</div>
|
36 |
</div>
|
37 |
|
38 |
-
## Notice
|
39 |
-
|
40 |
-
We recommend that everyone **use English to communicate on issues**, as this helps developers from around the world discuss, share experiences, and answer questions together.
|
41 |
-
|
42 |
-
## 🔥 Updates
|
43 |
-
`[2024-3-28]:` We provide: (1) more high-resolution pre-trained models (e.g., S, M, X) ([#142](https://github.com/AILab-CVC/YOLO-World/issues/142)); (2) pre-trained models with CLIP-Large text encoders. Most importantly, we preliminarily fix the **fine-tuning without `mask-refine`** and explore a new fine-tuning setting ([#160](https://github.com/AILab-CVC/YOLO-World/issues/160),[#76](https://github.com/AILab-CVC/YOLO-World/issues/76)). In addition, fine-tuning YOLO-World with `mask-refine` also obtains significant improvements, check more details in [configs/finetune_coco](./configs/finetune_coco/).
|
44 |
-
`[2024-3-16]:` We fix the bugs about the demo ([#110](https://github.com/AILab-CVC/YOLO-World/issues/110),[#94](https://github.com/AILab-CVC/YOLO-World/issues/94),[#129](https://github.com/AILab-CVC/YOLO-World/issues/129), [#125](https://github.com/AILab-CVC/YOLO-World/issues/125)) with visualizations of segmentation masks, and release [**YOLO-World with Embeddings**](./docs/prompt_yolo_world.md), which supports prompt tuning, text prompts and image prompts.
|
45 |
-
`[2024-3-3]:` We add the **high-resolution YOLO-World**, which supports `1280x1280` resolution with higher accuracy and better performance for small objects!
|
46 |
-
`[2024-2-29]:` We release the newest version of [ **YOLO-World-v2**](./docs/updates.md) with higher accuracy and faster speed! We hope the community can join us to improve YOLO-World!
|
47 |
-
`[2024-2-28]:` Excited to announce that YOLO-World has been accepted by **CVPR 2024**! We're continuing to make YOLO-World faster and stronger, as well as making it better to use for all.
|
48 |
-
`[2024-2-22]:` We sincerely thank [RoboFlow](https://roboflow.com/) and [@Skalskip92](https://twitter.com/skalskip92) for the [**Video Guide**](https://www.youtube.com/watch?v=X7gKBGVz4vs) about YOLO-World, nice work!
|
49 |
-
`[2024-2-18]:` We thank [@Skalskip92](https://twitter.com/skalskip92) for developing the wonderful segmentation demo via connecting YOLO-World and EfficientSAM. You can try it now at the [🤗 HuggingFace Spaces](https://huggingface.co/spaces/SkalskiP/YOLO-World).
|
50 |
-
`[2024-2-17]:` The largest model **X** of YOLO-World is released, which achieves better zero-shot performance!
|
51 |
-
`[2024-2-17]:` We release the code & models for **YOLO-World-Seg** now! YOLO-World now supports open-vocabulary / zero-shot object segmentation!
|
52 |
-
`[2024-2-15]:` The pre-traind YOLO-World-L with CC3M-Lite is released!
|
53 |
-
`[2024-2-14]:` We provide the [`image_demo`](demo.py) for inference on images or directories.
|
54 |
-
`[2024-2-10]:` We provide the [fine-tuning](./docs/finetuning.md) and [data](./docs/data.md) details for fine-tuning YOLO-World on the COCO dataset or the custom datasets!
|
55 |
-
`[2024-2-3]:` We support the `Gradio` demo now in the repo and you can build the YOLO-World demo on your own device!
|
56 |
-
`[2024-2-1]:` We've released the code and weights of YOLO-World now!
|
57 |
-
`[2024-2-1]:` We deploy the YOLO-World demo on [HuggingFace 🤗](https://huggingface.co/spaces/stevengrove/YOLO-World), you can try it now!
|
58 |
-
`[2024-1-31]:` We are excited to launch **YOLO-World**, a cutting-edge real-time open-vocabulary object detector.
|
59 |
-
|
60 |
|
61 |
-
##
|
62 |
|
63 |
-
|
64 |
-
If you have suggestions📃 or ideas💡,**we would love for you to bring them up in the [Roadmap](https://github.com/AILab-CVC/YOLO-World/issues/109)** ❤️!
|
65 |
-
> YOLO-World 目前正在积极开发中📃,如果你有建议或者想法💡,**我们非常希望您在 [Roadmap](https://github.com/AILab-CVC/YOLO-World/issues/109) 中提出来** ❤️!
|
66 |
|
67 |
-
##
|
68 |
-
|
69 |
-
We have set up an FAQ about YOLO-World in the discussion on GitHub. We hope everyone can raise issues or solutions during use here, and we also hope that everyone can quickly find solutions from it.
|
70 |
-
|
71 |
-
> 我们在GitHub的discussion中建立了关于YOLO-World的常见问答,这里将收集一些常见问题,同时大家可以在此提出使用中的问题或者解决方案,也希望大家能够从中快速寻找到解决方案
|
72 |
-
|
73 |
-
|
74 |
-
## Highlights & Introduction
|
75 |
|
76 |
This repo contains the PyTorch implementation, pre-trained weights, and pre-training/fine-tuning code for YOLO-World.
|
77 |
|
@@ -79,51 +48,36 @@ This repo contains the PyTorch implementation, pre-trained weights, and pre-trai
|
|
79 |
|
80 |
* YOLO-World is the next-generation YOLO detector, with a strong open-vocabulary detection capability and grounding ability.
|
81 |
|
82 |
-
* YOLO-World presents a *prompt-then-detect* paradigm for efficient user-vocabulary inference, which re-parameterizes vocabulary embeddings as parameters into the model and achieve superior inference speed. You can try to export your own detection model without extra training or fine-tuning in our [online demo](
|
83 |
|
84 |
|
85 |
<center>
|
86 |
<img width=800px src="./assets/yolo_arch.png">
|
87 |
</center>
|
88 |
|
89 |
-
## Model Zoo
|
90 |
|
91 |
-
|
92 |
|
93 |
-
|
94 |
|
95 |
-
<div><font size=2>
|
96 |
-
|
97 |
-
| model | Pre-train Data | Size | AP<sup>mini</su> | AP<sub>r</sub> | AP<sub>c</sub> | AP<sub>f</sub> | AP<sup>val</su> | AP<sub>r</sub> | AP<sub>c</sub> | AP<sub>f</sub> | weights |
|
98 |
-
| :------------------------------------------------------------------------------------------------------------------- | :------------------- | :----------------- | :--------------: | :------------: | :------------: | :------------: | :-------------: | :------------: | :------------: | :------------: | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
|
99 |
-
| [YOLO-Worldv2-S](./configs/pretrain/yolo_world_v2_s_vlpan_bn_2e-3_100e_4x8gpus_obj365v1_goldg_train_lvis_minival.py) | O365+GoldG | 640 | 22.7 | 16.3 | 20.8 | 25.5 | 17.3 | 11.3 | 14.9 | 22.7 |[HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_s_obj365v1_goldg_pretrain-55b943ea.pth)|
|
100 |
-
| [YOLO-Worldv2-S](./configs/pretrain/yolo_world_v2_s_vlpan_bn_2e-3_100e_4x8gpus_obj365v1_goldg_train_1280ft_lvis_minival.py) | O365+GoldG | 1280🔸 | 24.1 | 18.7 | 22.0 | 26.9 | 18.8 | 14.1 | 16.3 | 23.8 |[HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_s_obj365v1_goldg_pretrain_1280ft-fc4ff4f7.pth)|
|
101 |
-
| [YOLO-Worldv2-M](./configs/pretrain/yolo_world_v2_m_vlpan_bn_2e-3_100e_4x8gpus_obj365v1_goldg_train_lvis_minival.py) | O365+GoldG | 640 | 30.0 | 25.0 | 27.2 | 33.4 | 23.5 | 17.1 | 20.0 | 30.1 | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_m_obj365v1_goldg_pretrain-c6237d5b.pth)|
|
102 |
-
| [YOLO-Worldv2-M](./configs/pretrain/yolo_world_v2_m_vlpan_bn_2e-3_100e_4x8gpus_obj365v1_goldg_train_1280ft_lvis_minival.py) | O365+GoldG | 1280🔸 | 31.6 | 24.5 | 29.0 | 35.1 | 25.3 | 19.3 | 22.0 | 31.7 | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_m_obj365v1_goldg_pretrain_1280ft-77d0346d.pth)|
|
103 |
-
| [YOLO-Worldv2-L](./configs/pretrain/yolo_world_v2_l_vlpan_bn_2e-3_100e_4x8gpus_obj365v1_goldg_train_lvis_minival.py) | O365+GoldG | 640 | 33.0 | 22.6 | 32.0 | 35.8 | 26.0 | 18.6 | 23.0 | 32.6 | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_l_obj365v1_goldg_pretrain-a82b1fe3.pth)|
|
104 |
-
| [YOLO-Worldv2-L](./configs/pretrain/yolo_world_v2_l_vlpan_bn_2e-3_100e_4x8gpus_obj365v1_goldg_train_1280ft_lvis_minival.py) | O365+GoldG | 1280🔸 | 34.6 | 29.2 | 32.8 | 37.2 | 27.6 | 21.9 | 24.2 | 34.0 | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_l_obj365v1_goldg_pretrain_1280ft-9babe3f6.pth)|
|
105 |
-
| [YOLO-Worldv2-L (CLIP-Large)](./configs/pretrain/yolo_world_v2_l_clip_large_vlpan_bn_2e-3_100e_4x8gpus_obj365v1_goldg_train_lvis_minival.py) 🔥 | O365+GoldG | 640 | 34.0 | 22.0 | 32.6 | 37.4 | 27.1 | 19.9 | 23.9 | 33.9 | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_l_clip_large_o365v1_goldg_pretrain-8ff2e744.pth)|
|
106 |
-
| [YOLO-Worldv2-L (CLIP-Large)](./configs/pretrain/yolo_world_v2_l_clip_large_vlpan_bn_2e-3_100e_4x8gpus_obj365v1_goldg_train_800ft_lvis_minival.py) 🔥 | O365+GoldG | 800🔸 | 35.5 | 28.3 | 33.2 | 38.8 | 28.6 | 22.0 | 25.1 | 35.4 | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_l_clip_large_o365v1_goldg_pretrain_800ft-9df82e55.pth)|
|
107 |
-
| [YOLO-Worldv2-L](./configs/pretrain/yolo_world_v2_l_vlpan_bn_2e-3_100e_4x8gpus_obj365v1_goldg_train_lvis_minival.py) | O365+GoldG+CC3M-Lite | 640 | 32.9 | 25.3 | 31.1 | 35.8 | 26.1 | 20.6 | 22.6 | 32.3 | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_l_obj365v1_goldg_cc3mlite_pretrain-ca93cd1f.pth)|
|
108 |
-
| [YOLO-Worldv2-X](./configs/pretrain/yolo_world_v2_x_vlpan_bn_2e-3_100e_4x8gpus_obj365v1_goldg_train_lvis_minival.py) | O365+GoldG+CC3M-Lite | 640 | 35.4 | 28.7 | 32.9 | 38.7 | 28.4 | 20.6 | 25.6 | 35.0 | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_x_obj365v1_goldg_cc3mlite_pretrain-8698fbfa.pth) |
|
109 |
-
| [YOLO-Worldv2-XL](./configs/pretrain/yolo_world_v2_xl_vlpan_bn_2e-3_100e_4x8gpus_obj365v1_goldg_train_lvis_minival.py) | O365+GoldG+CC3M-Lite | 640 | 36.0 | 25.8 | 34.1 | 39.5 | 29.1 | 21.1 | 26.3 | 35.8 | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_x_obj365v1_goldg_cc3mlite_pretrain-8698fbfa.pth) |
|
110 |
-
|
111 |
-
</font>
|
112 |
-
</div>
|
113 |
|
114 |
-
|
115 |
-
|
116 |
-
|
117 |
-
|
118 |
-
|
|
|
119 |
|
120 |
-
|
121 |
|
122 |
-
|
|
|
|
|
|
|
|
|
123 |
|
124 |
-
|
125 |
-
|
126 |
-
|Pre-training Log | [Part-1](https://drive.google.com/file/d/1oib7pKfA2h1U_5-85H_s0Nz8jWd0R-WP/view?usp=drive_link), [Part-2](https://drive.google.com/file/d/11cZ6OZy80VTvBlZy3kzLAHCxx5Iix5-n/view?usp=drive_link) | [Part-1](https://drive.google.com/file/d/1E6vYSS8kBipGc8oQnsjAfeUAx8I9yOX7/view?usp=drive_link), [Part-2](https://drive.google.com/file/d/1fbM7vt2tgSeB8o_7tUDofWvpPNSViNj5/view?usp=drive_link) | [Part-1](https://drive.google.com/file/d/1Tola1QGJZTL6nGy3SBxKuknfNfREDm8J/view?usp=drive_link), [Part-2](https://drive.google.com/file/d/1mTBXniioUb0CdctCG4ckIU6idGo0NnH8/view?usp=drive_link) | [Final part](https://drive.google.com/file/d/1aEUA_EPQbXOrpxHTQYB6ieGXudb1PLpd/view?usp=drive_link)|
|
127 |
|
128 |
|
129 |
## Getting started
|
@@ -132,16 +86,19 @@ We provide the pre-training logs of `YOLO-World-v2`. Due to the unexpected error
|
|
132 |
|
133 |
YOLO-World is developed based on `torch==1.11.0` `mmyolo==0.6.0` and `mmdetection==3.0.0`.
|
134 |
|
135 |
-
#### Clone Project
|
136 |
-
|
137 |
```bash
|
138 |
-
|
139 |
-
|
140 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
141 |
|
142 |
-
```bash
|
143 |
-
pip install torch wheel -q
|
144 |
-
pip install -e .
|
145 |
```
|
146 |
|
147 |
### 2. Preparing Data
|
@@ -162,7 +119,7 @@ chmod +x tools/dist_train.sh
|
|
162 |
```
|
163 |
**NOTE:** YOLO-World is pre-trained on 4 nodes with 8 GPUs per node (32 GPUs in total). For pre-training, the `node_rank` and `nnodes` for multi-node training should be specified.
|
164 |
|
165 |
-
|
166 |
|
167 |
```bash
|
168 |
chmod +x tools/dist_test.sh
|
@@ -171,66 +128,26 @@ chmod +x tools/dist_test.sh
|
|
171 |
|
172 |
**NOTE:** We mainly evaluate the performance on LVIS-minival for pre-training.
|
173 |
|
174 |
-
## Fine-tuning YOLO-World
|
175 |
-
|
176 |
-
We provide the details about fine-tuning YOLO-World in [docs/fine-tuning](./docs/finetuning.md).
|
177 |
-
|
178 |
## Deployment
|
179 |
|
180 |
We provide the details about deployment for downstream applications in [docs/deployment](./docs/deploy.md).
|
181 |
-
You can directly download the ONNX model through the online [demo](
|
182 |
-
|
183 |
-
## Demo
|
184 |
-
|
185 |
-
### Gradio Demo
|
186 |
-
|
187 |
-
We provide the [Gradio](https://www.gradio.app/) demo for local devices:
|
188 |
-
|
189 |
-
```bash
|
190 |
-
pip install gradio==4.16.0
|
191 |
-
python demo.py path/to/config path/to/weights
|
192 |
-
```
|
193 |
-
|
194 |
-
Additionaly, you can use a Dockerfile to build an image with gradio. As a prerequisite, make sure you have respective drivers installed alongside [nvidia-container-runtime](https://stackoverflow.com/questions/59691207/docker-build-with-nvidia-runtime). Replace MODEL_NAME and WEIGHT_NAME with the respective values or ommit this and use default values from the [Dockerfile](Dockerfile#3)
|
195 |
-
|
196 |
-
```bash
|
197 |
-
docker build --build-arg="MODEL=MODEL_NAME" --build-arg="WEIGHT=WEIGHT_NAME" -t yolo_demo .
|
198 |
-
docker run --runtime nvidia -p 8080:8080
|
199 |
-
```
|
200 |
-
|
201 |
-
### Image Demo
|
202 |
-
|
203 |
-
We provide a simple image demo for inference on images with visualization outputs.
|
204 |
-
|
205 |
-
```bash
|
206 |
-
python image_demo.py path/to/config path/to/weights image/path/directory 'person,dog,cat' --topk 100 --threshold 0.005 --output-dir demo_outputs
|
207 |
-
```
|
208 |
-
|
209 |
-
**Notes:**
|
210 |
-
* The `image` can be a directory or a single image.
|
211 |
-
* The `texts` can be a string of categories (noun phrases) which is separated by a comma. We also support `txt` file in which each line contains a category ( noun phrases).
|
212 |
-
* The `topk` and `threshold` control the number of predictions and the confidence threshold.
|
213 |
-
|
214 |
-
### Google Golab Notebook
|
215 |
-
|
216 |
-
We sincerely thank [Onuralp](https://github.com/onuralpszr) for sharing the [Colab Demo](https://colab.research.google.com/drive/1F_7S5lSaFM06irBCZqjhbN7MpUXo6WwO?usp=sharing), you can have a try 😊!
|
217 |
-
|
218 |
|
219 |
## Acknowledgement
|
220 |
|
221 |
-
We sincerely thank [mmyolo](https://github.com/open-mmlab/mmyolo), [mmdetection](https://github.com/open-mmlab/mmdetection),
|
222 |
|
223 |
## Citations
|
224 |
If you find YOLO-World is useful in your research or applications, please consider giving us a star 🌟 and citing it.
|
225 |
|
226 |
```bibtex
|
227 |
-
@
|
228 |
title={YOLO-World: Real-Time Open-Vocabulary Object Detection},
|
229 |
author={Cheng, Tianheng and Song, Lin and Ge, Yixiao and Liu, Wenyu and Wang, Xinggang and Shan, Ying},
|
230 |
-
|
231 |
year={2024}
|
232 |
}
|
233 |
```
|
234 |
|
235 |
## Licence
|
236 |
-
YOLO-World is under the GPL-v3 Licence and is supported for comercial usage.
|
|
|
1 |
---
|
2 |
+
title: YOLO World
|
3 |
+
emoji: 🔥
|
4 |
+
colorFrom: pink
|
5 |
+
colorTo: blue
|
6 |
+
pinned: false
|
7 |
license: apache-2.0
|
8 |
+
app_file: app.py
|
9 |
+
sdk: gradio
|
10 |
+
sdk_version: 4.16.0
|
11 |
---
|
12 |
+
|
13 |
<div align="center">
|
14 |
+
<center>
|
15 |
+
<img width=500px src="./assets/yolo_logo.png">
|
16 |
+
</center>
|
17 |
<br>
|
18 |
+
<a href="https://scholar.google.com/citations?hl=zh-CN&user=PH8rJHYAAAAJ">Tianheng Cheng*</a><sup><span>2,3</span></sup>,
|
19 |
+
<a href="https://linsong.info/">Lin Song*</a><sup><span>1</span></sup>,
|
20 |
+
<a href="">Yixiao Ge</a><sup><span>1,2</span></sup>,
|
21 |
+
<a href="">Xinggang Wang</a><sup><span>3</span></sup>,
|
22 |
<a href="http://eic.hust.edu.cn/professor/liuwenyu/"> Wenyu Liu</a><sup><span>3</span></sup>,
|
23 |
+
<a href="">Ying Shan</a><sup><span>1,2</span></sup>
|
|
|
24 |
</br>
|
25 |
|
|
|
|
|
26 |
<sup>1</sup> Tencent AI Lab, <sup>2</sup> ARC Lab, Tencent PCG
|
27 |
<sup>3</sup> Huazhong University of Science and Technology
|
28 |
<br>
|
29 |
<div>
|
30 |
|
31 |
+
[![arxiv paper](https://img.shields.io/badge/arXiv-Paper-red)](https://arxiv.org/abs/)
|
32 |
+
[![video](https://img.shields.io/badge/🤗HugginngFace-Spaces-orange)](https://huggingface.co/)
|
|
|
|
|
|
|
|
|
33 |
[![license](https://img.shields.io/badge/License-GPLv3.0-blue)](LICENSE)
|
|
|
|
|
|
|
34 |
|
35 |
</div>
|
36 |
</div>
|
37 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
38 |
|
39 |
+
## Updates
|
40 |
|
41 |
+
`[2024-1-25]:` We are excited to launch **YOLO-World**, a cutting-edge real-time open-vocabulary object detector.
|
|
|
|
|
42 |
|
43 |
+
## Highlights
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
44 |
|
45 |
This repo contains the PyTorch implementation, pre-trained weights, and pre-training/fine-tuning code for YOLO-World.
|
46 |
|
|
|
48 |
|
49 |
* YOLO-World is the next-generation YOLO detector, with a strong open-vocabulary detection capability and grounding ability.
|
50 |
|
51 |
+
* YOLO-World presents a *prompt-then-detect* paradigm for efficient user-vocabulary inference, which re-parameterizes vocabulary embeddings as parameters into the model and achieve superior inference speed. You can try to export your own detection model without extra training or fine-tuning in our [online demo]()!
|
52 |
|
53 |
|
54 |
<center>
|
55 |
<img width=800px src="./assets/yolo_arch.png">
|
56 |
</center>
|
57 |
|
|
|
58 |
|
59 |
+
## Abstract
|
60 |
|
61 |
+
The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools. However, their reliance on predefined and trained object categories limits their applicability in open scenarios. Addressing this limitation, we introduce YOLO-World, an innovative approach that enhances YOLO with open-vocabulary detection capabilities through vision-language modeling and pre-training on large-scale datasets. Specifically, we propose a new Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and region-text contrastive loss to facilitate the interaction between visual and linguistic information. Our method excels in detecting a wide range of objects in a zero-shot manner with high efficiency. On the challenging LVIS dataset, YOLO-World achieves 35.4 AP with 52.0 FPS on V100, which outperforms many state-of-the-art methods in terms of both accuracy and speed. Furthermore, the fine-tuned YOLO-World achieves remarkable performance on several downstream tasks, including object detection and open-vocabulary instance segmentation.
|
62 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
63 |
|
64 |
+
## Demo
|
65 |
+
|
66 |
+
|
67 |
+
## Main Results
|
68 |
+
|
69 |
+
We've pre-trained YOLO-World-S/M/L from scratch and evaluate on the `LVIS val-1.0` and `LVIS minival`. We provide the pre-trained model weights and training logs for applications/research or re-producing the results.
|
70 |
|
71 |
+
### Zero-shot Inference on LVIS dataset
|
72 |
|
73 |
+
| model | Pre-train Data | AP | AP<sub>r</sub> | AP<sub>c</sub> | AP<sub>f</sub> | FPS(V100) | weights | log |
|
74 |
+
| :---- | :------------- | :-:| :------------: |:-------------: | :-------: | :-----: | :---: | :---: |
|
75 |
+
| [YOLO-World-S](./configs/pretrain/yolo_world_s_t2i_bn_2e-4_100e_4x8gpus_obj365v1_goldg_train_lvis_minival.py) | O365+GoldG | 17.6 | 11.9 | 14.5 | 23.2 | - | [wecom](https://drive.weixin.qq.com/s?k=AJEAIQdfAAoREsieRl) | [log]() |
|
76 |
+
| [YOLO-World-M](./configs/pretrain/yolo_world_m_t2i_bn_2e-4_100e_4x8gpus_obj365v1_goldg_train_lvis_minival.py) | O365+GoldG | 23.5 | 17.2 | 20.4 | 29.6 | - | [wecom](https://drive.weixin.qq.com/s?k=AJEAIQdfAAoj0byBC0) | [log]() |
|
77 |
+
| [YOLO-World-L](./configs/pretrain/yolo_world_l_t2i_bn_2e-4_100e_4x8gpus_obj365v1_goldg_train_lvis_minival.py) | O365+GoldG | 25.7 | 18.7 | 22.6 | 32.2 | - | [wecom](https://drive.weixin.qq.com/s?k=AJEAIQdfAAoK06oxO2) | [log]() |
|
78 |
|
79 |
+
**NOTE:**
|
80 |
+
1. The evaluation results are tested on LVIS minival in a zero-shot manner.
|
|
|
81 |
|
82 |
|
83 |
## Getting started
|
|
|
86 |
|
87 |
YOLO-World is developed based on `torch==1.11.0` `mmyolo==0.6.0` and `mmdetection==3.0.0`.
|
88 |
|
|
|
|
|
89 |
```bash
|
90 |
+
# install key dependencies
|
91 |
+
pip install mmdetection==3.0.0 mmengine transformers
|
92 |
+
|
93 |
+
# clone the repo
|
94 |
+
git clone https://xxxx.YOLO-World.git
|
95 |
+
cd YOLO-World
|
96 |
+
|
97 |
+
# install mmyolo
|
98 |
+
mkdir third_party
|
99 |
+
git clone https://github.com/open-mmlab/mmyolo.git
|
100 |
+
cd ..
|
101 |
|
|
|
|
|
|
|
102 |
```
|
103 |
|
104 |
### 2. Preparing Data
|
|
|
119 |
```
|
120 |
**NOTE:** YOLO-World is pre-trained on 4 nodes with 8 GPUs per node (32 GPUs in total). For pre-training, the `node_rank` and `nnodes` for multi-node training should be specified.
|
121 |
|
122 |
+
Evalutating YOLO-World is also easy:
|
123 |
|
124 |
```bash
|
125 |
chmod +x tools/dist_test.sh
|
|
|
128 |
|
129 |
**NOTE:** We mainly evaluate the performance on LVIS-minival for pre-training.
|
130 |
|
|
|
|
|
|
|
|
|
131 |
## Deployment
|
132 |
|
133 |
We provide the details about deployment for downstream applications in [docs/deployment](./docs/deploy.md).
|
134 |
+
You can directly download the ONNX model through the online [demo]() in Huggingface Spaces 🤗.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
135 |
|
136 |
## Acknowledgement
|
137 |
|
138 |
+
We sincerely thank [mmyolo](https://github.com/open-mmlab/mmyolo), [mmdetection](https://github.com/open-mmlab/mmdetection), and [transformers](https://github.com/huggingface/transformers) for providing their wonderful code to the community!
|
139 |
|
140 |
## Citations
|
141 |
If you find YOLO-World is useful in your research or applications, please consider giving us a star 🌟 and citing it.
|
142 |
|
143 |
```bibtex
|
144 |
+
@article{cheng2024yolow,
|
145 |
title={YOLO-World: Real-Time Open-Vocabulary Object Detection},
|
146 |
author={Cheng, Tianheng and Song, Lin and Ge, Yixiao and Liu, Wenyu and Wang, Xinggang and Shan, Ying},
|
147 |
+
journal={arXiv preprint arXiv:},
|
148 |
year={2024}
|
149 |
}
|
150 |
```
|
151 |
|
152 |
## Licence
|
153 |
+
YOLO-World is under the GPL-v3 Licence and is supported for comercial usage.
|