Configuration Parsing
Warning:
Invalid JSON for config file model_index.json
ThinkGen: Generalized Thinking for Visual Generation
ThinkGen is the first think-driven visual generation framework that explicitly leverages Multimodal Large Language Models' (MLLMs) Chain-of-Thought (CoT) reasoning in various generation scenarios. ThinkGen employs a decoupled architecture comprising a pretrained MLLM and a Diffusion Transformer (DiT), wherein the MLLM generates tailored instructions based on user intent, and the DiT produces high-quality images guided by these instructions.
Authors: Siyu Jiao, Yiheng Lin, Yujie Zhong, Qi She, Wei Zhou, Xiaohan Lan, Zilong Huang, Fei Yu, Yingchen Yu, Yunqing Zhao, Yao Zhao, Yunchao Wei.
π Quick Start
π οΈ Environment Setup
# 1. Clone the repo
git clone https://github.com/jiaosiyuu/ThinkGen.git
cd ThinkGen
# 2. (Optional) Create a clean Python environment
conda create -n thinkgen python=3.11
conda activate thinkgen
# 3. Install dependencies
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install -r req.txt
# ThinkGen runs even without flash-attn, though we recommend install it for best performance.
pip install --no-cache-dir flash-attn==2.7.4.post1 --no-build-isolation
π» Sample Usage
from ThinkGen.model import ThinkGen_Chat
import os
model_path = "JSYuuu/ThinkGen"
chat_model = ThinkGen_Chat(
model_path=model_path,
dtype='bf16',
height=1024,
width=1024
)
# 1. Image Generation
messages = [
{"type": "text", "value": "A young woman wearing a straw hat, standing in a golden wheat field."}
]
results = chat_model.generate_image(messages)
results.images[0].save("result.png")
# 2. Image Generation with Thinking (CoT)
# This enables the MLLM's CoT reasoning for generation
results_think = chat_model.generate_image(messages, think=True)
print(f"cot & rewrite prompt:
{results_think.prompt_cot}")
results_think.images[0].save("result_think.png")
# 3. Image Understanding
messages_und = [
{"type": "image", "value": "images/teaser.png"},
{"type": "text", "value": "Describe this image"}
]
response = chat_model.generate_text(messages_und)
print(response)
Acknowledgments
This work builds upon the following great open-source projects:
- OmniGen2: https://github.com/VectorSpaceLab/OmniGen2
- Qwen3VL: https://github.com/QwenLM/Qwen3-VL
- EasyR1: https://github.com/hiyouga/EasyR1
- Flow-GRPO: https://github.com/yifan123/flow_grpo
Citation
@article{jiao2025thinkgen,
title={ThinkGen: Generalized Thinking for Visual Generation},
author={Jiao, Siyu and Lin, Yiheng and Zhong, Yujie and She, Qi and Zhou, Wei and Lan, Xiaohan and Huang, Zilong and Yu, Fei and Yu, Yingchen and Zhao, Yunqing and Zhao, Yao and Wei, Yunchao},
journal={arXiv preprint arXiv:2512.23568},
year={2025}
}
License
This work is licensed under the Apache 2.0 license.
- Downloads last month
- 87
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support