File size: 6,022 Bytes

---
library_name: transformers
license: apache-2.0
datasets:
- detection-datasets/coco
language:
- en
pipeline_tag: object-detection
---

# Relation DETR model with ResNet-50 backbone

## Model Details

The model is not available now. We are working on integrating Relation-DETR into transformers. We will update as soon as possible.

### Model Description

![image/png](https://cdn-uploads.huggingface.co/production/uploads/66939171e3a813f3bb10e804/kNzBZZ2SFq6Wgk2ki_c5t.png)

> This paper presents a general scheme for enhancing the convergence and performance of DETR (DEtection TRansformer).
> We investigate the slow convergence problem in transformers from a new perspective, suggesting that it arises from
> the self-attention that introduces no structural bias over inputs. To address this issue, we explore incorporating
> position relation prior as attention bias to augment object detection, following the verification of its statistical
> significance using a proposed quantitative macroscopic correlation (MC) metric. Our approach, termed Relation-DETR,
> introduces an encoder to construct position relation embeddings for progressive attention refinement, which further
> extends the traditional streaming pipeline of DETR into a contrastive relation pipeline to address the conflicts
> between non-duplicate predictions and positive supervision. Extensive experiments on both generic and task-specific
> datasets demonstrate the effectiveness of our approach. Under the same configurations, Relation-DETR achieves a
> significant improvement (+2.0% AP compared to DINO), state-of-the-art performance (51.7% AP for 1x and 52.1% AP
> for 2x settings), and a remarkably faster convergence speed (over 40% AP with only 2 training epochs) than existing
> DETR detectors on COCO val2017. Moreover, the proposed relation encoder serves as a universal plug-in-and-play component,
>  bringing clear improvements for theoretically any DETR-like methods. Furthermore, we introduce a class-agnostic detection
> dataset, SA-Det-100k. The experimental results on the dataset illustrate that the proposed explicit position relation
> achieves a clear improvement of 1.3% AP, highlighting its potential towards universal object detection.
> The code and dataset are available at [this https URL](https://github.com/xiuqhou/Relation-DETR).

- **Developed by:** [Xiuquan Hou]
- **Shared by:** Xiuquan Hou
- **Model type:** Relation DETR
- **License:** Apache-2.0

### Model Sources

<!-- Provide the basic links for the model. -->

- **Repository:** [https://github.com/xiuqhou/Relation-DETR](https://github.com/xiuqhou/Relation-DETR)
- **Paper:** [Relation DETR: Exploring Explicit Position Relation Prior for Object Detection](https://arxiv.org/abs/2407.11699)
<!-- - **Demo [optional]:** [More Information Needed] -->

## How to Get Started with the Model

Use the code below to get started with the model.

```python
import torch
import requests

from PIL import Image
from transformers import RelationDetrForObjectDetection, RelationDetrImageProcessor

url = 'http://images.cocodataset.org/val2017/000000039769.jpg' 
image = Image.open(requests.get(url, stream=True).raw)

image_processor = RelationDetrImageProcessor.from_pretrained("PekingU/rtdetr_r50vd")
model = RelationDetrForObjectDetection.from_pretrained("PekingU/rtdetr_r50vd")

inputs = image_processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

results = image_processor.post_process_object_detection(outputs, target_sizes=torch.tensor([image.size[::-1]]), threshold=0.3)

for result in results:
    for score, label_id, box in zip(result["scores"], result["labels"], result["boxes"]):
        score, label = score.item(), label_id.item()
        box = [round(i, 2) for i in box.tolist()]
        print(f"{model.config.id2label[label]}: {score:.2f} {box}")
```

This should output

```python
cat: 0.96 [343.8, 24.9, 639.52, 371.71]
cat: 0.95 [12.6, 54.34, 316.37, 471.86]
remote: 0.95 [40.09, 73.49, 175.52, 118.06]
remote: 0.90 [333.09, 76.71, 369.77, 187.4]
couch: 0.90 [0.44, 0.53, 640.44, 475.54]
```

## Training Details

Relation DEtection TRansformer (Relation DETR) model is trained on [COCO 2017 object detection](https://cocodataset.org/#download) (118k annotated images) for 12 epochs (aka 1x schedule).

## Evaluation

| Model               | Backbone             | Epoch |  mAP  | AP<sub>50 | AP<sub>75 | AP<sub>S | AP<sub>M | AP<sub>L |
| ------------------- | -------------------- | :---: | :---: | :-------: | :-------: | :------: | :------: | :------: |
| Relation DETR       | ResNet50             |  12   | 51.7  |   69.1    |   56.3    |   36.1   |   55.6   |   66.1   |
| Relation DETR       | Swin-L<sub>(IN-22K)  |  12   | 57.8  |   76.1    |   62.9    |   41.2   |   62.1   |   74.4   |
| Relation DETR       | ResNet50             |  24   | 52.1  |   69.7    |   56.6    |   36.1   |   56.0   |   66.5   |
| Relation DETR       | Swin-L<sub>(IN-22K)  |  24   | 58.1  |   76.4    |   63.5    |   41.8   |   63.0   |   73.5   |
| Relation-DETR<sup>† | Focal-L<sub>(IN-22K) | 4+24  | 63.5  |   80.8    |   69.1    |   47.2   |   66.9   |   77.0   |

† means finetuned model on COCO after pretraining on Object365.

## Model Architecture and Objective

![image/png](https://cdn-uploads.huggingface.co/production/uploads/66939171e3a813f3bb10e804/UMtLjkxrwoDikUBlgj-Fc.png)

![image/png](https://cdn-uploads.huggingface.co/production/uploads/66939171e3a813f3bb10e804/MBbCM-zQGgUjKUmwB0yje.png)

## Citation and BibTeX

```
@misc{hou2024relationdetrexploringexplicit,
      title={Relation DETR: Exploring Explicit Position Relation Prior for Object Detection}, 
      author={Xiuquan Hou and Meiqin Liu and Senlin Zhang and Ping Wei and Badong Chen and Xuguang Lan},
      year={2024},
      eprint={2407.11699},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2407.11699}, 
}
```

## Model Card Authors

[xiuqhou](https://huggingface.co/xiuqhou)