# GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing
Rongyao Fang1*, Chengqi Duan2*, Kun Wang3, Linjiang Huang6, Hao Li1,4, Shilin Yan, Hao Tian3, Xingyu Zeng3, Rui Zhao3, Jifeng Dai4,5, Xihui Liu2, Hongsheng Li1 1CUHK MMLab, 2HKU MMLab, 3SenseTime, 4Shanghai AI Laboratory, 5Tsinghua University, 6Beihang University *Equal contribution
PaperIntroductionDatasetsModelResults🤗 Hugging FaceLicense
## Introduction We present **Generation Chain-of-Thought (GoT)**, a novel paradigm that enables generation and editing through an explicit language reasoning process before outputting images. This approach transforms conventional text-to-image generation and editing into a reasoning-guided framework that analyzes semantic relationships and spatial arrangements. GoT pioneers a new direction for reasoning-driven visual generation and editing, producing images that better align with human intent through: - **Semantic-Spatial Reasoning**: Integrates both semantic understanding and explicit spatial coordinates - **Unified Framework**: Handles both image generation and editing with the same architecture ## Released Datasets | Dataset | Link | Amount | |---------|------|--------| | **Laion-Aesthetics-High-Resolution-GoT** | [🤗 HuggingFace](https://huggingface.co/datasets/LucasFang/Laion-Aesthetics-High-Resolution-GoT) | 3.77M | | **JourneyDB-GoT** | [🤗 HuggingFace](https://huggingface.co/datasets/LucasFang/JourneyDB-GoT) | 4.09M | | **OmniEdit-GoT** | [🤗 HuggingFace](https://huggingface.co/datasets/LucasFang/OmniEdit-GoT) | 736K | ## Dataset Features ### Laion-Aesthetics-High-Resolution-GoT - 3.77 million High-quality images filtered for sizes larger than 512 pixels from Laion-Aesthetics - Prompts and GoT descriptions from Qwen2-VL - Prompts averaging 110.81 characters - GoT descriptions averaging 811.56 characters - 3.78 bounding boxes per image on average ### JourneyDB-GoT - 4.09 million high-quality AI-generated images - Prompts and GoT descriptions from Qwen2-VL - Prompts averaging 149.78 characters - GoT descriptions averaging 906.01 characters - 4.09 bounding boxes per image on average - Please download the images from [JourneyDB dataset](https://opendatalab.com/OpenDataLab/JourneyDB/tree/main/raw/JourneyDB/train/imgs) ### OmniEdit-GoT - 736K high-quality image editing samples from OmniEdit - Diverse editing operations (addition, removal, swap, attribute changes, style transfer) - Detailed reasoning chains with step-by-step editing processes - Precise spatial coordinate annotations for editing regions - Please download the images from [OmniEdit dataset](https://huggingface.co/datasets/TIGER-Lab/OmniEdit-Filtered-1.2M) ## Model Features Our GoT framework consists of two key components: 1. **Semantic-Spatial MLLM**: Generates detailed reasoning chains with spatial information using Qwen2.5-VL as the backbone 2. **SSGM Diffusion Module**: Leverages the semantic guidance, spatial layouts, and reference images to create high-quality visual outputs The Semantic-Spatial Guidance Module (SSGM) combines three guidance pathways: - **Semantic Guidance**: Captures relationships and attributes - **Spatial Guidance**: Controls precise object placement - **Reference Guidance**: Provides context for editing tasks ## Results ### Text-to-Image Generation GoT achieves state-of-the-art performance on the GenEval benchmark, particularly excelling in composition tasks:
| Method | Architecture | Overall | Single Obj. | Two Obj. | Counting | Colors | Position | Attr. Binding | |--------|--------------|---------|-------------|----------|----------|--------|----------|---------------| | SD-XL | Unet+CLIP | 0.55 | 0.98 | 0.74 | 0.39 | 0.85 | 0.15 | 0.23 | | SD3 | MMDIT+CLIP+T5 | 0.62 | 0.98 | 0.74 | 0.63 | 0.67 | 0.34 | 0.36 | | Emu3-Gen | Autoregressive | 0.54 | 0.98 | 0.71 | 0.34 | 0.81 | 0.17 | 0.21 | | Janus | Autoregressive | 0.61 | 0.97 | 0.68 | 0.30 | 0.84 | 0.46 | 0.42 | | JanusFlow | Autoregressive | 0.63 | 0.97 | 0.59 | 0.45 | 0.83 | 0.53 | 0.42 | | **GoT Framework** | Unet+Qwen2.5-VL | **0.64** | **0.99** | 0.69 | **0.67** | **0.85** | 0.34 | 0.27 |
### Image Editing Our approach also demonstrates superior performance on image editing benchmarks:
| Method | Emu-Edit | | ImagenHub | Reason-Edit | |--------|----------|--------|-----------|------------| | | CLIP-I | CLIP-T | GPT-4o Eval. | GPT-4o Eval. | | IP2P | 0.834 | 0.219 | 0.308 | 0.286 | | MagicBrush | 0.838 | 0.222 | 0.513 | 0.334 | | SEED-X | 0.825 | 0.272 | 0.166 | 0.239 | | CosXL-Edit | 0.860 | 0.274 | 0.464 | 0.325 | | **GoT Framework** | **0.864** | **0.276** | **0.533** | 0.561 |
## Usage ### Dependencies - Python >= 3.8 (Recommend to use [Anaconda](https://www.anaconda.com/download/#linux)) - [PyTorch >=2.0.1](https://pytorch.org/) - NVIDIA GPU + [CUDA](https://developer.nvidia.com/cuda-downloads) ### Installation Clone the repo and install dependent packages ```bash git clone git@github.com:rongyaofang/GoT.git cd GoT pip install -r requirements.txt ``` ### Model Weights Place the required model weights in the `./pretrained` directory as follows: 1. GoT-6B model weights 2. [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) 3. [Stable Diffusion XL Base 1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) Your directory structure should match the following: ``` GoT ├── pretrained │ ├── GoT-6B │ ├── Qwen2.5-VL-3B-Instruct │ └── stable-diffusion-xl-base-1.0 ├── ... ``` ## License This code is released under the MIT License.