MV-Adapter: Multi-view Consistent Image Generation Made Easy
Abstract
Existing multi-view image generation methods often make invasive modifications to pre-trained text-to-image (T2I) models and require full fine-tuning, leading to (1) high computational costs, especially with large base models and high-resolution images, and (2) degradation in image quality due to optimization difficulties and scarce high-quality 3D data. In this paper, we propose the first adapter-based solution for multi-view image generation, and introduce MV-Adapter, a versatile plug-and-play adapter that enhances T2I models and their derivatives without altering the original network structure or feature space. By updating fewer parameters, MV-Adapter enables efficient training and preserves the prior knowledge embedded in pre-trained models, mitigating overfitting risks. To efficiently model the 3D geometric knowledge within the adapter, we introduce innovative designs that include duplicated self-attention layers and parallel attention architecture, enabling the adapter to inherit the powerful priors of the pre-trained models to model the novel 3D knowledge. Moreover, we present a unified condition encoder that seamlessly integrates camera parameters and geometric information, facilitating applications such as text- and image-based 3D generation and texturing. MV-Adapter achieves multi-view generation at 768 resolution on Stable Diffusion XL (SDXL), and demonstrates adaptability and versatility. It can also be extended to arbitrary view generation, enabling broader applications. We demonstrate that MV-Adapter sets a new quality standard for multi-view image generation, and opens up new possibilities due to its efficiency, adaptability and versatility.
Community
🔥Made Multi-view Generation Easy Now🔥
We present MV-Adapter, a versatile plug-and-play adapter that seamlessly transform T2I to multi-view generators.
Highlights:
- Generate 768x768 multiviews using SDXL or any personalized model
- Support text-to-multiview, image-to-multiview, text-/image-and-geometry-to-multiview
- Support arbitrary view generation
- Support text/image-to-3D, text/image-to-texture
[code and demo released🚀]
Project page: https://huanngzh.github.io/MV-Adapter-Page/
Demo Zoos: https://github.com/huanngzh/MV-Adapter?tab=readme-ov-file#model-zoo--demos
Code: https://github.com/huanngzh/MV-Adapter
Paper: https://arxiv.org/abs/2412.03632
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- 3D-Adapter: Geometry-Consistent Multi-View Diffusion for High-Quality 3D Generation (2024)
- FlexGen: Flexible Multi-View Generation from Text and Image Inputs (2024)
- 3DIS: Depth-Driven Decoupled Instance Synthesis for Text-to-Image Generation (2024)
- Tex4D: Zero-shot 4D Scene Texturing with Video Diffusion Models (2024)
- StyleTex: Style Image-Guided Texture Generation for 3D Models (2024)
- DreamCraft3D++: Efficient Hierarchical 3D Generation with Multi-Plane Reconstruction Model (2024)
- GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper