GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing
1CUHK MMLab, 2HKU MMLab, 3SenseTime, 4Shanghai AI Laboratory, 5Tsinghua University, 6Beihang University
*Equal contribution
Introduction
We present Generation Chain-of-Thought (GoT), a novel paradigm that enables generation and editing through an explicit language reasoning process before outputting images. This approach transforms conventional text-to-image generation and editing into a reasoning-guided framework that analyzes semantic relationships and spatial arrangements.
GoT pioneers a new direction for reasoning-driven visual generation and editing, producing images that better align with human intent through:
- Semantic-Spatial Reasoning: Integrates both semantic understanding and explicit spatial coordinates
- Unified Framework: Handles both image generation and editing with the same architecture
Released Datasets
Dataset | Link | Amount |
---|---|---|
Laion-Aesthetics-High-Resolution-GoT | π€ HuggingFace | 3.77M |
JourneyDB-GoT | π€ HuggingFace | 4.09M |
OmniEdit-GoT | π€ HuggingFace | 736K |
Dataset Features
Laion-Aesthetics-High-Resolution-GoT
- 3.77 million High-quality images filtered for sizes larger than 512 pixels from Laion-Aesthetics
- Prompts and GoT descriptions from Qwen2-VL
- Prompts averaging 110.81 characters
- GoT descriptions averaging 811.56 characters
- 3.78 bounding boxes per image on average
JourneyDB-GoT
- 4.09 million high-quality AI-generated images
- Prompts and GoT descriptions from Qwen2-VL
- Prompts averaging 149.78 characters
- GoT descriptions averaging 906.01 characters
- 4.09 bounding boxes per image on average
- Please download the images from JourneyDB dataset
OmniEdit-GoT
- 736K high-quality image editing samples from OmniEdit
- Diverse editing operations (addition, removal, swap, attribute changes, style transfer)
- Detailed reasoning chains with step-by-step editing processes
- Precise spatial coordinate annotations for editing regions
- Please download the images from OmniEdit dataset
Model Features
Our GoT framework consists of two key components:
- Semantic-Spatial MLLM: Generates detailed reasoning chains with spatial information using Qwen2.5-VL as the backbone
- SSGM Diffusion Module: Leverages the semantic guidance, spatial layouts, and reference images to create high-quality visual outputs
The Semantic-Spatial Guidance Module (SSGM) combines three guidance pathways:
- Semantic Guidance: Captures relationships and attributes
- Spatial Guidance: Controls precise object placement
- Reference Guidance: Provides context for editing tasks
Results
Text-to-Image Generation
GoT achieves state-of-the-art performance on the GenEval benchmark, particularly excelling in composition tasks:
Method | Architecture | Overall | Single Obj. | Two Obj. | Counting | Colors | Position | Attr. Binding |
---|---|---|---|---|---|---|---|---|
SD-XL | Unet+CLIP | 0.55 | 0.98 | 0.74 | 0.39 | 0.85 | 0.15 | 0.23 |
SD3 | MMDIT+CLIP+T5 | 0.62 | 0.98 | 0.74 | 0.63 | 0.67 | 0.34 | 0.36 |
Emu3-Gen | Autoregressive | 0.54 | 0.98 | 0.71 | 0.34 | 0.81 | 0.17 | 0.21 |
Janus | Autoregressive | 0.61 | 0.97 | 0.68 | 0.30 | 0.84 | 0.46 | 0.42 |
JanusFlow | Autoregressive | 0.63 | 0.97 | 0.59 | 0.45 | 0.83 | 0.53 | 0.42 |
GoT Framework | Unet+Qwen2.5-VL | 0.64 | 0.99 | 0.69 | 0.67 | 0.85 | 0.34 | 0.27 |
Image Editing
Our approach also demonstrates superior performance on image editing benchmarks:
Method | Emu-Edit | ImagenHub | Reason-Edit | |
---|---|---|---|---|
CLIP-I | CLIP-T | GPT-4o Eval. | GPT-4o Eval. | |
IP2P | 0.834 | 0.219 | 0.308 | 0.286 |
MagicBrush | 0.838 | 0.222 | 0.513 | 0.334 |
SEED-X | 0.825 | 0.272 | 0.166 | 0.239 |
CosXL-Edit | 0.860 | 0.274 | 0.464 | 0.325 |
GoT Framework | 0.864 | 0.276 | 0.533 | 0.561 |
Usage
Dependencies
- Python >= 3.8 (Recommend to use Anaconda)
- PyTorch >=2.0.1
- NVIDIA GPU + CUDA
Installation
Clone the repo and install dependent packages
git clone git@github.com:rongyaofang/GoT.git
cd GoT
pip install -r requirements.txt
Model Weights
Place the required model weights in the ./pretrained
directory as follows:
- GoT-6B model weights
- Qwen2.5-VL-3B-Instruct
- Stable Diffusion XL Base 1.0
Your directory structure should match the following:
GoT
βββ pretrained
β βββ GoT-6B
β βββ Qwen2.5-VL-3B-Instruct
β βββ stable-diffusion-xl-base-1.0
βββ ...
License
This code is released under the MIT License.