GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

Rongyao Fang^1*, Chengqi Duan^2*, Kun Wang³, Linjiang Huang⁶, Hao Li^1,4, Shilin Yan, Hao Tian³, Xingyu Zeng³, Rui Zhao³, Jifeng Dai^4,5, Xihui Liu², Hongsheng Li¹

¹CUHK MMLab, ²HKU MMLab, ³SenseTime, ⁴Shanghai AI Laboratory, ⁵Tsinghua University, ⁶Beihang University

*Equal contribution

Paper • Introduction • Datasets • Model • Results • 🤗 Hugging Face • License

Introduction

We present Generation Chain-of-Thought (GoT), a novel paradigm that enables generation and editing through an explicit language reasoning process before outputting images. This approach transforms conventional text-to-image generation and editing into a reasoning-guided framework that analyzes semantic relationships and spatial arrangements.

GoT pioneers a new direction for reasoning-driven visual generation and editing, producing images that better align with human intent through:

Semantic-Spatial Reasoning: Integrates both semantic understanding and explicit spatial coordinates
Unified Framework: Handles both image generation and editing with the same architecture

Released Datasets

Dataset	Link	Amount
Laion-Aesthetics-High-Resolution-GoT	🤗 HuggingFace	3.77M
JourneyDB-GoT	🤗 HuggingFace	4.09M
OmniEdit-GoT	🤗 HuggingFace	736K

Dataset Features

Laion-Aesthetics-High-Resolution-GoT

3.77 million High-quality images filtered for sizes larger than 512 pixels from Laion-Aesthetics
Prompts and GoT descriptions from Qwen2-VL
Prompts averaging 110.81 characters
GoT descriptions averaging 811.56 characters
3.78 bounding boxes per image on average

JourneyDB-GoT

4.09 million high-quality AI-generated images
Prompts and GoT descriptions from Qwen2-VL
Prompts averaging 149.78 characters
GoT descriptions averaging 906.01 characters
4.09 bounding boxes per image on average
Please download the images from JourneyDB dataset

OmniEdit-GoT

736K high-quality image editing samples from OmniEdit
Diverse editing operations (addition, removal, swap, attribute changes, style transfer)
Detailed reasoning chains with step-by-step editing processes
Precise spatial coordinate annotations for editing regions
Please download the images from OmniEdit dataset

Model Features

Our GoT framework consists of two key components:

Semantic-Spatial MLLM: Generates detailed reasoning chains with spatial information using Qwen2.5-VL as the backbone
SSGM Diffusion Module: Leverages the semantic guidance, spatial layouts, and reference images to create high-quality visual outputs

The Semantic-Spatial Guidance Module (SSGM) combines three guidance pathways:

Semantic Guidance: Captures relationships and attributes
Spatial Guidance: Controls precise object placement
Reference Guidance: Provides context for editing tasks

Results

Text-to-Image Generation

GoT achieves state-of-the-art performance on the GenEval benchmark, particularly excelling in composition tasks:

Method	Architecture	Overall	Single Obj.	Two Obj.	Counting	Colors	Position	Attr. Binding
SD-XL	Unet+CLIP	0.55	0.98	0.74	0.39	0.85	0.15	0.23
SD3	MMDIT+CLIP+T5	0.62	0.98	0.74	0.63	0.67	0.34	0.36
Emu3-Gen	Autoregressive	0.54	0.98	0.71	0.34	0.81	0.17	0.21
Janus	Autoregressive	0.61	0.97	0.68	0.30	0.84	0.46	0.42
JanusFlow	Autoregressive	0.63	0.97	0.59	0.45	0.83	0.53	0.42
GoT Framework	Unet+Qwen2.5-VL	0.64	0.99	0.69	0.67	0.85	0.34	0.27

Image Editing

Our approach also demonstrates superior performance on image editing benchmarks:

Method	Emu-Edit		ImagenHub	Reason-Edit
	CLIP-I	CLIP-T	GPT-4o Eval.	GPT-4o Eval.
IP2P	0.834	0.219	0.308	0.286
MagicBrush	0.838	0.222	0.513	0.334
SEED-X	0.825	0.272	0.166	0.239
CosXL-Edit	0.860	0.274	0.464	0.325
GoT Framework	0.864	0.276	0.533	0.561

Usage

Dependencies

Python >= 3.8 (Recommend to use Anaconda)
PyTorch >=2.0.1
NVIDIA GPU + CUDA

Installation

Clone the repo and install dependent packages

git clone git@github.com:rongyaofang/GoT.git
cd GoT
pip install -r requirements.txt

Model Weights

Place the required model weights in the ./pretrained directory as follows:

Your directory structure should match the following:

GoT
├── pretrained
│   ├── GoT-6B
│   ├── Qwen2.5-VL-3B-Instruct
│   └── stable-diffusion-xl-base-1.0
├── ...

License

This code is released under the MIT License.