Instruct-Imagen: Image Generation with Multi-modal Instruction
Abstract
This paper presents instruct-imagen, a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision. It uses natural language to amalgamate disparate modalities (e.g., text, edge, style, subject, etc.), such that abundant generation intents can be standardized in a uniform format. We then build instruct-imagen by fine-tuning a pre-trained text-to-image diffusion model with a two-stage framework. First, we adapt the model using the retrieval-augmented training, to enhance model's capabilities to ground its generation on external multimodal context. Subsequently, we fine-tune the adapted model on diverse image generation tasks that requires vision-language understanding (e.g., subject-driven generation, etc.), each paired with a multi-modal instruction encapsulating the task's essence. Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain and demonstrates promising generalization to unseen and more complex tasks.
Community
Open-sourced when???
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following (2023)
- Emu Edit: Precise Image Editing via Recognition and Generation Tasks (2023)
- VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation (2023)
- Customization Assistant for Text-to-image Generation (2023)
- C3Net: Compound Conditioned ControlNet for Multimodal Content Generation (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
Unveiling Instruct-Imagen: Multi-modal Image Generation Revolution
Links ๐:
๐ Subscribe: https://www.youtube.com/@Arxflix
๐ Twitter: https://x.com/arxflix
๐ LMNT (Partner): https://lmnt.com/
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper