GoT-6B / README.md
LucasFang's picture
Update README.md
7e3768f verified

GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

Rongyao Fang1*, Chengqi Duan2*, Kun Wang3, Linjiang Huang6, Hao Li1,4, Shilin Yan, Hao Tian3, Xingyu Zeng3, Rui Zhao3, Jifeng Dai4,5, Xihui Liu2, Hongsheng Li1

1CUHK MMLab, 2HKU MMLab, 3SenseTime, 4Shanghai AI Laboratory, 5Tsinghua University, 6Beihang University

*Equal contribution

Paper β€’ Introduction β€’ Datasets β€’ Model β€’ Results β€’ πŸ€— Hugging Face β€’ License

Introduction

We present Generation Chain-of-Thought (GoT), a novel paradigm that enables generation and editing through an explicit language reasoning process before outputting images. This approach transforms conventional text-to-image generation and editing into a reasoning-guided framework that analyzes semantic relationships and spatial arrangements.

GoT pioneers a new direction for reasoning-driven visual generation and editing, producing images that better align with human intent through:

  • Semantic-Spatial Reasoning: Integrates both semantic understanding and explicit spatial coordinates
  • Unified Framework: Handles both image generation and editing with the same architecture

Released Datasets

Dataset Link Amount
Laion-Aesthetics-High-Resolution-GoT πŸ€— HuggingFace 3.77M
JourneyDB-GoT πŸ€— HuggingFace 4.09M
OmniEdit-GoT πŸ€— HuggingFace 736K

Dataset Features

Laion-Aesthetics-High-Resolution-GoT

  • 3.77 million High-quality images filtered for sizes larger than 512 pixels from Laion-Aesthetics
  • Prompts and GoT descriptions from Qwen2-VL
  • Prompts averaging 110.81 characters
  • GoT descriptions averaging 811.56 characters
  • 3.78 bounding boxes per image on average

JourneyDB-GoT

  • 4.09 million high-quality AI-generated images
  • Prompts and GoT descriptions from Qwen2-VL
  • Prompts averaging 149.78 characters
  • GoT descriptions averaging 906.01 characters
  • 4.09 bounding boxes per image on average
  • Please download the images from JourneyDB dataset

OmniEdit-GoT

  • 736K high-quality image editing samples from OmniEdit
  • Diverse editing operations (addition, removal, swap, attribute changes, style transfer)
  • Detailed reasoning chains with step-by-step editing processes
  • Precise spatial coordinate annotations for editing regions
  • Please download the images from OmniEdit dataset

Model Features

Our GoT framework consists of two key components:

  1. Semantic-Spatial MLLM: Generates detailed reasoning chains with spatial information using Qwen2.5-VL as the backbone
  2. SSGM Diffusion Module: Leverages the semantic guidance, spatial layouts, and reference images to create high-quality visual outputs

The Semantic-Spatial Guidance Module (SSGM) combines three guidance pathways:

  • Semantic Guidance: Captures relationships and attributes
  • Spatial Guidance: Controls precise object placement
  • Reference Guidance: Provides context for editing tasks

Results

Text-to-Image Generation

GoT achieves state-of-the-art performance on the GenEval benchmark, particularly excelling in composition tasks:

Method Architecture Overall Single Obj. Two Obj. Counting Colors Position Attr. Binding
SD-XL Unet+CLIP 0.55 0.98 0.74 0.39 0.85 0.15 0.23
SD3 MMDIT+CLIP+T5 0.62 0.98 0.74 0.63 0.67 0.34 0.36
Emu3-Gen Autoregressive 0.54 0.98 0.71 0.34 0.81 0.17 0.21
Janus Autoregressive 0.61 0.97 0.68 0.30 0.84 0.46 0.42
JanusFlow Autoregressive 0.63 0.97 0.59 0.45 0.83 0.53 0.42
GoT Framework Unet+Qwen2.5-VL 0.64 0.99 0.69 0.67 0.85 0.34 0.27

Image Editing

Our approach also demonstrates superior performance on image editing benchmarks:

Method Emu-Edit ImagenHub Reason-Edit
CLIP-I CLIP-T GPT-4o Eval. GPT-4o Eval.
IP2P 0.834 0.219 0.308 0.286
MagicBrush 0.838 0.222 0.513 0.334
SEED-X 0.825 0.272 0.166 0.239
CosXL-Edit 0.860 0.274 0.464 0.325
GoT Framework 0.864 0.276 0.533 0.561

Usage

Dependencies

Installation

Clone the repo and install dependent packages

git clone git@github.com:rongyaofang/GoT.git
cd GoT
pip install -r requirements.txt

Model Weights

Place the required model weights in the ./pretrained directory as follows:

  1. GoT-6B model weights
  2. Qwen2.5-VL-3B-Instruct
  3. Stable Diffusion XL Base 1.0

Your directory structure should match the following:

GoT
β”œβ”€β”€ pretrained
β”‚   β”œβ”€β”€ GoT-6B
β”‚   β”œβ”€β”€ Qwen2.5-VL-3B-Instruct
β”‚   └── stable-diffusion-xl-base-1.0
β”œβ”€β”€ ...

License

This code is released under the MIT License.