Configuration Parsing Warning: Invalid JSON for config file model_index.json

ThinkGen: Generalized Thinking for Visual Generation

ThinkGen is the first think-driven visual generation framework that explicitly leverages Multimodal Large Language Models' (MLLMs) Chain-of-Thought (CoT) reasoning in various generation scenarios. ThinkGen employs a decoupled architecture comprising a pretrained MLLM and a Diffusion Transformer (DiT), wherein the MLLM generates tailored instructions based on user intent, and the DiT produces high-quality images guided by these instructions.

Paper: ThinkGen: Generalized Thinking for Visual Generation
Code: GitHub Repository

Authors: Siyu Jiao, Yiheng Lin, Yujie Zhong, Qi She, Wei Zhou, Xiaohan Lan, Zilong Huang, Fei Yu, Yingchen Yu, Yunqing Zhao, Yao Zhao, Yunchao Wei.

🚀 Quick Start

🛠️ Environment Setup

# 1. Clone the repo
git clone https://github.com/jiaosiyuu/ThinkGen.git
cd ThinkGen

# 2. (Optional) Create a clean Python environment
conda create -n thinkgen python=3.11
conda activate thinkgen

# 3. Install dependencies
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install -r req.txt

# ThinkGen runs even without flash-attn, though we recommend install it for best performance.
pip install --no-cache-dir flash-attn==2.7.4.post1 --no-build-isolation

💻 Sample Usage

from ThinkGen.model import ThinkGen_Chat
import os

model_path = "JSYuuu/ThinkGen"

chat_model = ThinkGen_Chat(
    model_path=model_path,
    dtype='bf16',
    height=1024,
    width=1024
)

# 1. Image Generation
messages = [
    {"type": "text", "value": "A young woman wearing a straw hat, standing in a golden wheat field."}
]
results = chat_model.generate_image(messages)
results.images[0].save("result.png")

# 2. Image Generation with Thinking (CoT)
# This enables the MLLM's CoT reasoning for generation
results_think = chat_model.generate_image(messages, think=True)
print(f"cot & rewrite prompt: 
{results_think.prompt_cot}")
results_think.images[0].save("result_think.png")

# 3. Image Understanding
messages_und = [
    {"type": "image", "value": "images/teaser.png"},
    {"type": "text", "value": "Describe this image"}
]
response = chat_model.generate_text(messages_und)
print(response)

Acknowledgments

This work builds upon the following great open-source projects:

OmniGen2: https://github.com/VectorSpaceLab/OmniGen2
Qwen3VL: https://github.com/QwenLM/Qwen3-VL
EasyR1: https://github.com/hiyouga/EasyR1
Flow-GRPO: https://github.com/yifan123/flow_grpo

Citation

@article{jiao2025thinkgen,
  title={ThinkGen: Generalized Thinking for Visual Generation},
  author={Jiao, Siyu and Lin, Yiheng and Zhong, Yujie and She, Qi and Zhou, Wei and Lan, Xiaohan and Huang, Zilong and Yu, Fei and Yu, Yingchen and Zhao, Yunqing and Zhao, Yao and Wei, Yunchao},
  journal={arXiv preprint arXiv:2512.23568},
  year={2025}
}

License

This work is licensed under the Apache 2.0 license.

Downloads last month: 87

Inference Providers NEW

Any-to-Any

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support