Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers

Published on May 11, 2023
· Featured in Daily Papers on May 12, 2023


We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we propose to randomly crop and resize regions of positional embeddings instead of using the whole image positional embeddings. This better matches the use of positional embeddings at region-level in the detection finetuning phase. In addition, we replace the common softmax cross entropy loss in contrastive learning with focal loss to better learn the informative yet difficult examples. Finally, we leverage recent advances in novel object proposals to improve open-vocabulary detection finetuning. We evaluate our full model on the LVIS and COCO open-vocabulary detection benchmarks and zero-shot transfer. RO-ViT achieves a state-of-the-art 32.1 AP_r on LVIS, surpassing the best existing approach by +5.8 points in addition to competitive zero-shot transfer detection. Surprisingly, RO-ViT improves the image-level representation as well and achieves the state of the art on 9 out of 12 metrics on COCO and Flickr image-text retrieval benchmarks, outperforming competitive approaches with larger models.


Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a vision-language foundation model (like CLIP). Contrastive learning (image-text pretraining in a shared embedding space) that focuses on region-level understanding instead of image-level recognition. Cropped positional embeddings (CPE): Upsample positional embeddings (PE), randomly crop and resize a region, use that as the image-level PE during training. This gives better region-level generalization (indirectly, image learned as a region crop from some larger image - better matches downstream use). More structure and symmetry in learned position embeddings. ViT encoder output given to global average pooling (GAP - which gives final image embeddings). Proposes a focal loss that is better than the existing softmax cross entropy (CE) loss - for pretraining (contrastive learning). During downstream open-vocabulary object detection, replace GAP with detector heads and get detector region scores and VLM region scores across base classes (available during training time) and novel classes (unseen - observed during test time only). Used for open-vocabulary object detection, image-text retrieval, and transfer object detection. From Google.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite in a model to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite in a dataset to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite in a Space to link it from this page.

Collections including this paper 1