File size: 6,022 Bytes
914a834
 
a735394
 
 
 
 
 
914a834
 
eb8051e
914a834
 
 
d2c6e5d
 
914a834
 
eb8051e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
914a834
 
 
eb8051e
 
 
914a834
 
 
 
 
eb8051e
 
 
914a834
eb8051e
 
914a834
eb8051e
 
914a834
eb8051e
 
914a834
eb8051e
914a834
eb8051e
 
914a834
eb8051e
914a834
eb8051e
 
 
 
 
 
914a834
eb8051e
914a834
eb8051e
 
 
 
 
 
 
914a834
eb8051e
914a834
eb8051e
914a834
 
 
eb8051e
 
 
 
 
 
 
914a834
eb8051e
914a834
eb8051e
914a834
eb8051e
914a834
eb8051e
914a834
eb8051e
914a834
6ddaca2
 
 
 
 
 
 
 
 
 
 
914a834
eb8051e
914a834
eb8051e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
---
library_name: transformers
license: apache-2.0
datasets:
- detection-datasets/coco
language:
- en
pipeline_tag: object-detection
---

# Relation DETR model with ResNet-50 backbone

## Model Details

The model is not available now. We are working on integrating Relation-DETR into transformers. We will update as soon as possible.

### Model Description

![image/png](https://cdn-uploads.huggingface.co/production/uploads/66939171e3a813f3bb10e804/kNzBZZ2SFq6Wgk2ki_c5t.png)

> This paper presents a general scheme for enhancing the convergence and performance of DETR (DEtection TRansformer).
> We investigate the slow convergence problem in transformers from a new perspective, suggesting that it arises from
> the self-attention that introduces no structural bias over inputs. To address this issue, we explore incorporating
> position relation prior as attention bias to augment object detection, following the verification of its statistical
> significance using a proposed quantitative macroscopic correlation (MC) metric. Our approach, termed Relation-DETR,
> introduces an encoder to construct position relation embeddings for progressive attention refinement, which further
> extends the traditional streaming pipeline of DETR into a contrastive relation pipeline to address the conflicts
> between non-duplicate predictions and positive supervision. Extensive experiments on both generic and task-specific
> datasets demonstrate the effectiveness of our approach. Under the same configurations, Relation-DETR achieves a
> significant improvement (+2.0% AP compared to DINO), state-of-the-art performance (51.7% AP for 1x and 52.1% AP
> for 2x settings), and a remarkably faster convergence speed (over 40% AP with only 2 training epochs) than existing
> DETR detectors on COCO val2017. Moreover, the proposed relation encoder serves as a universal plug-in-and-play component,
>  bringing clear improvements for theoretically any DETR-like methods. Furthermore, we introduce a class-agnostic detection
> dataset, SA-Det-100k. The experimental results on the dataset illustrate that the proposed explicit position relation
> achieves a clear improvement of 1.3% AP, highlighting its potential towards universal object detection.
> The code and dataset are available at [this https URL](https://github.com/xiuqhou/Relation-DETR).

- **Developed by:** [Xiuquan Hou]
- **Shared by:** Xiuquan Hou
- **Model type:** Relation DETR
- **License:** Apache-2.0

### Model Sources

<!-- Provide the basic links for the model. -->

- **Repository:** [https://github.com/xiuqhou/Relation-DETR](https://github.com/xiuqhou/Relation-DETR)
- **Paper:** [Relation DETR: Exploring Explicit Position Relation Prior for Object Detection](https://arxiv.org/abs/2407.11699)
<!-- - **Demo [optional]:** [More Information Needed] -->

## How to Get Started with the Model

Use the code below to get started with the model.

```python
import torch
import requests

from PIL import Image
from transformers import RelationDetrForObjectDetection, RelationDetrImageProcessor

url = 'http://images.cocodataset.org/val2017/000000039769.jpg' 
image = Image.open(requests.get(url, stream=True).raw)

image_processor = RelationDetrImageProcessor.from_pretrained("PekingU/rtdetr_r50vd")
model = RelationDetrForObjectDetection.from_pretrained("PekingU/rtdetr_r50vd")

inputs = image_processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

results = image_processor.post_process_object_detection(outputs, target_sizes=torch.tensor([image.size[::-1]]), threshold=0.3)

for result in results:
    for score, label_id, box in zip(result["scores"], result["labels"], result["boxes"]):
        score, label = score.item(), label_id.item()
        box = [round(i, 2) for i in box.tolist()]
        print(f"{model.config.id2label[label]}: {score:.2f} {box}")
```

This should output

```python
cat: 0.96 [343.8, 24.9, 639.52, 371.71]
cat: 0.95 [12.6, 54.34, 316.37, 471.86]
remote: 0.95 [40.09, 73.49, 175.52, 118.06]
remote: 0.90 [333.09, 76.71, 369.77, 187.4]
couch: 0.90 [0.44, 0.53, 640.44, 475.54]
```

## Training Details

Relation DEtection TRansformer (Relation DETR) model is trained on [COCO 2017 object detection](https://cocodataset.org/#download) (118k annotated images) for 12 epochs (aka 1x schedule).

## Evaluation

| Model               | Backbone             | Epoch |  mAP  | AP<sub>50 | AP<sub>75 | AP<sub>S | AP<sub>M | AP<sub>L |
| ------------------- | -------------------- | :---: | :---: | :-------: | :-------: | :------: | :------: | :------: |
| Relation DETR       | ResNet50             |  12   | 51.7  |   69.1    |   56.3    |   36.1   |   55.6   |   66.1   |
| Relation DETR       | Swin-L<sub>(IN-22K)  |  12   | 57.8  |   76.1    |   62.9    |   41.2   |   62.1   |   74.4   |
| Relation DETR       | ResNet50             |  24   | 52.1  |   69.7    |   56.6    |   36.1   |   56.0   |   66.5   |
| Relation DETR       | Swin-L<sub>(IN-22K)  |  24   | 58.1  |   76.4    |   63.5    |   41.8   |   63.0   |   73.5   |
| Relation-DETR<sup>† | Focal-L<sub>(IN-22K) | 4+24  | 63.5  |   80.8    |   69.1    |   47.2   |   66.9   |   77.0   |

† means finetuned model on COCO after pretraining on Object365.

## Model Architecture and Objective

![image/png](https://cdn-uploads.huggingface.co/production/uploads/66939171e3a813f3bb10e804/UMtLjkxrwoDikUBlgj-Fc.png)

![image/png](https://cdn-uploads.huggingface.co/production/uploads/66939171e3a813f3bb10e804/MBbCM-zQGgUjKUmwB0yje.png)

## Citation and BibTeX

```
@misc{hou2024relationdetrexploringexplicit,
      title={Relation DETR: Exploring Explicit Position Relation Prior for Object Detection}, 
      author={Xiuquan Hou and Meiqin Liu and Senlin Zhang and Ping Wei and Badong Chen and Xuguang Lan},
      year={2024},
      eprint={2407.11699},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2407.11699}, 
}
```

## Model Card Authors

[xiuqhou](https://huggingface.co/xiuqhou)