ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation
Abstract
This paper introduces a tuning-free method for both object insertion and subject-driven generation. The task involves composing an object, given multiple views, into a scene specified by either an image or text. Existing methods struggle to fully meet the task's challenging objectives: (i) seamlessly composing the object into the scene with photorealistic pose and lighting, and (ii) preserving the object's identity. We hypothesize that achieving these goals requires large scale supervision, but manually collecting sufficient data is simply too expensive. The key observation in this paper is that many mass-produced objects recur across multiple images of large unlabeled datasets, in different scenes, poses, and lighting conditions. We use this observation to create massive supervision by retrieving sets of diverse views of the same object. This powerful paired dataset enables us to train a straightforward text-to-image diffusion architecture to map the object and scene descriptions to the composited image. We compare our method, ObjectMate, with state-of-the-art methods for object insertion and subject-driven generation, using a single or multiple references. Empirically, ObjectMate achieves superior identity preservation and more photorealistic composition. Differently from many other multi-reference methods, ObjectMate does not require slow test-time tuning.
Community
๐ Explore our project page: https://object-mate.com
๐ฐ๏ธ [TLDR]: We find that large-scale web datasets contain identical instances of objects that reappear in different poses and scenes (e.g., specific car models, laptops, IKEA furniture). We call this the Object Recurrence Prior. Leveraging it, we create a massive supervised dataset of 4.5M objects for subject-driven generation and object insertion. Our method achieves state-of-the-art identity preservation without requiring test-time fine-tuning.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Learning Complex Non-Rigid Image Edits from Multimodal Conditioning (2024)
- MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation (2024)
- DreamMix: Decoupling Object Attributes for Enhanced Editability in Customized Image Inpainting (2024)
- UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics (2024)
- LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations (2024)
- Conditional Text-to-Image Generation with Reference Guidance (2024)
- Harlequin: Color-Driven Generation of Synthetic Data for Referring Expression Comprehension (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper