Spaces:
Running
on
T4
title: LLM Grounded Diffusion
emoji: π
colorFrom: red
colorTo: pink
sdk: gradio
sdk_version: 3.35.2
app_file: app.py
pinned: true
tags:
- llm
- diffusion
- grounding
- grounded
- llm-grounded
- text-to-image
- language
- large language models
- layout
- generation
- generative
- customization
- personalization
- prompting
- chatgpt
- gpt-3.5
- gpt-4
LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models
LLM + Stable Diffusion => better prompt understanding in text2image generation π€©
Project Page | 5-minute Blog Post | ArXiv Paper (ArXiv Abstract) | Github | Cite our work if our ideas inspire you.
Tips:
1. If ChatGPT doesn't generate layout, add/remove the trailing space (added by default) and/or use GPT-4.
2. You can perform multi-round specification by giving ChatGPT follow-up requests (e.g., make the object boxes bigger).
3. You can also try prompts in Simplified Chinese. If you want to try prompts in another language, translate the first line of last example to your language.
4. The diffusion model only runs 20 steps by default. You can make it run 50 steps to get higher quality images (or tweak frozen steps/guidance steps for better guidance and coherence).
5. Duplicate this space and add GPU to skip the queue and run our model faster. {duplicate_html}
Implementation note: In this demo, we replace the attention manipulation in our layout-guided Stable Diffusion described in our paper with GLIGEN due to much faster inference speed (FlashAttention supported, no backprop needed during inference). Compared to vanilla GLIGEN, we have better coherence. Other parts of text-to-image pipeline, including single object generation and SAM, remain the same. The settings and examples in the prompt are simplified in this demo.
Credits:
This space uses code from diffusers, GLIGEN, and layout-guidance. Using their code means adhering to their license.