arxiv:2306.03881

Emergent Correspondence from Image Diffusion

Published on Jun 6, 2023

· Submitted by

akhaliq on Jun 7, 2023

Upvote

Authors:

Luming Tang ,

Menglin Jia ,

Cheng Perng Phoo ,

Abstract

Finding correspondences between images is a fundamental problem in computer vision. In this paper, we show that correspondence emerges in image diffusion models without any explicit supervision. We propose a simple strategy to extract this implicit knowledge out of diffusion networks as image features, namely DIffusion FeaTures (DIFT), and use them to establish correspondences between real images. Without any additional fine-tuning or supervision on the task-specific data or annotations, DIFT is able to outperform both weakly-supervised methods and competitive off-the-shelf features in identifying semantic, geometric, and temporal correspondences. Particularly for semantic correspondence, DIFT from Stable Diffusion is able to outperform DINO and OpenCLIP by 19 and 14 accuracy points respectively on the challenging SPair-71k benchmark. It even outperforms the state-of-the-art supervised methods on 9 out of 18 categories while remaining on par for the overall performance. Project page: https://diffusionfeatures.github.io

View arXiv page View PDF Add to collection

Community

pyp1

Jun 7, 2023

Very impressive work! always amazed by the fascinating emergent properties of these pretrained models.

One nitpick thought: Stable Diffusion is trained on LAION (text-image pairs, text as condition), and therefore the model does receive explicit textual supervision.

Although same properties also emerge from ADM, which is completely unsup

lt453

Paper author Jun 8, 2023

Very impressive work! always amazed by the fascinating emergent properties of these pretrained models.

One nitpick thought: Stable Diffusion is trained on LAION (text-image pairs, text as condition), and therefore the model does receive explicit textual supervision.

Although same properties also emerge from ADM, which is completely unsup

Thanks for your interest in our work and the nice words! We are also truely exicted by these properties.

So here under the specific context, when it says "explicit supervision", it is referring to the correspondence supervision, i.e., (sparse or dense) labeled corresponding points between images, which is not used either in diffusion features and other self-supervised learning features (DINO, OpenCLIP).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2306.03881 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2306.03881 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2306.03881 in a Space README.md to link it from this page.