KevinNg99 commited on
Commit
c455378
Β·
1 Parent(s): 2d86132

Add README.

Browse files
Files changed (2) hide show
  1. README.md +257 -0
  2. checkpoints-download.md +74 -0
README.md ADDED
@@ -0,0 +1,257 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ [δΈ­ζ–‡ι˜…θ―»](./README_CN.md)
3
+
4
+ <p align="center">
5
+ <img src="./assets/logo.png" height=100>
6
+ </p>
7
+
8
+ <div align="center">
9
+
10
+ # HunyuanImage-2.1: An Efficient Diffusion Model for High-Resolution (2K) Text-to-Image Generation​
11
+
12
+ </div>
13
+
14
+ <div align="center">
15
+ <a href=https://github.com/Tencent-Hunyuan/HunyuanImage-2.1 target="_blank"><img src=https://img.shields.io/badge/Code-black.svg?logo=github height=22px></a>
16
+ <a href="https://hunyuan.tencent.com" target="_blank">
17
+ <img src="https://img.shields.io/badge/Demo%20Page-blue" height="22px"></a>
18
+ <a href=https://huggingface.co/tencent/HunyuanImage-2.1 target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20Models-d96902.svg height=22px></a>
19
+ <a href="#" target="_blank"><img src="https://img.shields.io/badge/Report-Coming%20Soon-blue" height="22px"></a>
20
+ <a href= https://hunyuan-promptenhancer.github.io/ target="_blank"><img src=https://img.shields.io/badge/PromptEnhancer-bb8a2e.svg?logo=github height=22px></a>
21
+ <a href=https://x.com/TencentHunyuan target="_blank"><img src=https://img.shields.io/badge/Hunyuan-black.svg?logo=x height=22px></a>
22
+ </div>
23
+
24
+
25
+ -----
26
+
27
+ This repo contains PyTorch model definitions, pretrained weights and inference/sampling code for our HunyuanImage-2.1. You can find more visualizations on our [project page](https://hunyuan.tencent.com).
28
+
29
+
30
+ ## πŸ”₯πŸ”₯πŸ”₯ Latest Updates
31
+
32
+ - September 8, 2025: πŸš€ Released inference code and model weights for HunyuanImage-2.1.
33
+
34
+ ## πŸŽ₯ Demo
35
+
36
+ <div align="center">
37
+ <img src="./assets/show_cases.png" width=100% alt="HunyuanImage 2.1 Demo">
38
+ </div>
39
+
40
+ ## Contents
41
+ - [HunyuanImage-2.1: An Efficient Diffusion Model for High-Resolution (2K) Text-to-Image Generation​](#hunyuanimage-21-an-efficient-diffusion-model-for-high-resolution-2k-text-to-image-generation)
42
+ - [πŸŽ₯ Demo](#-demo)
43
+ - [πŸ”₯πŸ”₯πŸ”₯ Latest Updates](#-latest-updates)
44
+ - [Contents](#contents)
45
+ - [**Abstract**](#abstract)
46
+ - [**HunyuanImage-2.1 Overall Pipeline**](#hunyuanimage-21-overall-pipeline)
47
+ - [πŸŽ‰ **HunyuanImage-2.1 Key Features**](#-hunyuanimage-21-key-features)
48
+ - [Prompt Enhanced Demo](#prompt-enhanced-demo)
49
+ - [πŸ“ˆ Comparisons](#-comparisons)
50
+ - [πŸ“œ Requirements](#-requirements)
51
+ - [πŸ› οΈ Dependencies and Installation](#️-dependencies-and-installation)
52
+ - [🧱 Download Pretrained Models](#-download-pretrained-models)
53
+ - [πŸ”‘ Usage](#-usage)
54
+ - [πŸ”— BibTeX](#-bibtex)
55
+ - [Acknowledgements](#acknowledgements)
56
+ - [Github Star History](#github-star-history)
57
+
58
+ ---
59
+ <!-- - [🧩 Community Contributions](#-community-contributions) -->
60
+ ## Abstract
61
+ We present HunyuanImage-2.1, a highly efficient text-to-image model that is capable of generating 2K (2048 Γ— 2048) resolution images. Leveraging an extensive dataset and structured captions involving multiple expert models, we significantly enhance text-image alignment capabilities. The model employs a highly expressive VAE with a (32 Γ— 32) spatial compression ratio, substantially reducing computational costs.
62
+
63
+ Our architecture consists of two stages:
64
+ 1. ​Base text-to-image Model:​​ The first stage is a text-to-image model that utilizes two text encoders: a multimodal large language model (MLLM) to improve image-text alignment, and a multi-language, character-aware encoder to enhance text rendering across various languages. This stage features a single- and dual-stream diffusion transformer with 17 billion parameters. To optimize aesthetics and structural coherence, we apply reinforcement learning from human feedback (RLHF).
65
+ 2. Refiner Model: The second stage introduces a refiner model that further enhances image quality and clarity, while minimizing artifacts.
66
+
67
+ Additionally, we developed the PromptEnhancer module to further boost model performance, and employed meanflow distillation for efficient inference. HunyuanImage-2.1 demonstrates robust semantic alignment and cross-scenario generalization, leading to improved consistency between text and image, enhanced control of scene details, character poses, and expressions, and the ability to generate multiple objects with distinct descriptions.
68
+
69
+
70
+
71
+
72
+ ## HunyuanImage-2.1 Overall Pipeline
73
+
74
+ ### Training Data and Caption
75
+
76
+ Structured captions provide hierarchical semantic information at short, medium, long, and extra-long levels, significantly enhancing the model’s responsiveness to complex semantics. Innovatively, an OCR agent and IP RAG are introduced to address the shortcomings of general VLM captioners in dense text and world knowledge descriptions, while a bidirectional verification strategy ensures caption accuracy.
77
+
78
+
79
+ ### Text-to-Image Model Architecture
80
+
81
+ <p align="center">
82
+ <img src="./assets/framework_overall.png" width=100% alt="HunyuanImage 2.1 Architecture">
83
+ </p>
84
+
85
+
86
+
87
+ Core Components:
88
+ * High-Compression VAE with REPA Training Acceleration:
89
+ * A VAE with a 32Γ— compression rate drastically reduces the number of input tokens for the DiT model. By aligning its feature space with DINOv2 features, we facilitate the training of high-compression VAEs. As a result, our model generates 2K images with the same token length (and thus similar inference time) as other models require for 1K images, achieving superior inference efficiency.
90
+ * Multi-bucket, multi-resolution REPA loss aligns DiT features with a high-dimensional semantic feature space, accelerating model convergence.
91
+ * Dual Text Encoder:
92
+ * A vision-language multimodal encoder is employed to better understand scene descriptions, character actions, and detailed requirements.
93
+ * A multilingual ByT5 text encoder is introduced to specialize in text generation and multilingual expression.
94
+ * Network: A single- and dual-stream diffusion transformer with 17 billion parameters.
95
+
96
+ ### Reinforcement Learning from Human Feedback
97
+ Two-Stage Post-Training with Reinforcement Learning: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are applied sequentially in two post-training stages. We introduce a Reward Distribution Alignment algorithm, which innovatively incorporates high-quality images as selected samples to ensure stable and improved reinforcement learning outcomes.
98
+
99
+ ### Rewriting Model
100
+ <p align="center">
101
+ <img src="./assets/framework_prompt_rewrite.png" width=90% alt="HunyuanImage 2.1 Architecture">
102
+ </p>
103
+
104
+ * The first systematic industrial-level rewriting model. SFT training structurally rewrites user text instructions to enrich visual expression, while GRPO training employs a fine-grained semantic AlignEvaluator reward model to substantially improve the semantics of images generated from rewritten text. The AlignEvaluator covers 6 major categories and 24 fine-grained assessment points. PromptEnhancer supports both Chinese and English rewriting and demonstrates general applicability in enhancing semantics for both open-source and proprietary text-to-image models.
105
+
106
+ ### Model distillation
107
+ We propose a novel distillation method based on meanflow that addresses the key challenges of instability and inefficiency inherent in standard meanflow training. This approach enables high-quality image generation with only a few sampling steps. To our knowledge, this is the first successful application of meanflow to an industrial-scale model.
108
+
109
+
110
+
111
+
112
+
113
+ ## πŸŽ‰ HunyuanImage-2.1 Key Features
114
+
115
+ - **High-Quality Generation**: Efficiently produces ultra-high-definition (2K) images with cinematic composition.
116
+ - **Multilingual Support**: Provides native support for both Chinese and English prompts.
117
+ - **Advanced Architecture**: Built on a multi-modal, single- and dual-stream combined DiT (Diffusion Transformer) backbone.
118
+ - **Glyph-Aware Processing**: Utilizes ByT5's text rendering capabilities for improved text generation accuracy.
119
+ - **Flexible Aspect Ratios**: Supports a variety of image aspect ratios (1:1, 16:9, 9:16, 4:3, 3:4, 3:2, 2:3).
120
+ - **Prompt Enhancement**: Automatically rewrites prompts to improve descriptive accuracy and visual quality.
121
+
122
+
123
+ ## Prompt Enhanced Demo
124
+ To improve the quality and detail of generated images, we use a prompt rewriting model. This model automatically enhances user-provided text prompts by adding detailed and descriptive information.
125
+ <p align="center">
126
+ <img src="./assets/reprompt.png" width=100% alt="Human Evaluation with Other Models">
127
+ </p>
128
+
129
+
130
+ ## πŸ“ˆ Comparisons
131
+
132
+ ### SSAE Evaluation
133
+ SSAE (Structured Semantic Alignment Evaluation) is an intelligent evaluation metric for image-text alignment based on advanced multimodal large language models (MLLMs). We extracted 3500 key points across 12 categories, then used multimodal large language models to automatically evaluate and score by comparing the generated images with these key points based on the visual content of the images. Mean Image Accuracy represents the image-wise average score across all key points, while Global Accuracy directly calculates the average score across all key points.
134
+ <p align="center">
135
+ <table>
136
+ <thead>
137
+ <tr>
138
+ <th rowspan="2">Model</th> <th rowspan="2">Open Source</th> <th rowspan="2">Mean Image Accuracy</th> <th rowspan="2">Global Accuracy</th> <th colspan="4" style="text-align: center;">Primary Subject</th> <th colspan="3" style="text-align: center;">Secondary Subject</th> <th colspan="2" style="text-align: center;">Scene</th> <th colspan="3" style="text-align: center;">Other</th>
139
+ </tr>
140
+ <tr>
141
+ <th>Noun</th> <th>Key Attributes</th> <th>Other Attributes</th> <th>Action</th> <th>Noun</th> <th>Attributes</th> <th>Action</th> <th>Noun</th> <th>Attributes</th> <th>Shot</th> <th>Style</th> <th>Composition</th>
142
+ </tr>
143
+ </thead>
144
+ <tbody>
145
+ <tr>
146
+ <td>FLUX-dev</td> <td>βœ…</td> <td>0.7122</td> <td>0.6995</td> <td>0.7965</td> <td>0.7824</td> <td>0.5993</td> <td>0.5777</td> <td>0.7950</td> <td>0.6826</td> <td>0.6923</td> <td>0.8453</td> <td>0.8094</td> <td>0.6452</td> <td>0.7096</td> <td>0.6190</td>
147
+ </tr>
148
+ <tr>
149
+ <td>Seedream-3.0</td> <td>❌</td> <td>0.8827</td> <td>0.8792</td> <td>0.9490</td> <td>0.9311</td> <td>0.8242</td> <td>0.8177</td> <td>0.9747</td> <td>0.9103</td> <td>0.8400</td> <td>0.9489</td> <td>0.8848</td> <td>0.7582</td> <td>0.8726</td> <td>0.7619</td>
150
+ </tr>
151
+ <tr>
152
+ <td>Qwen-Image</td> <td>βœ…</td> <td>0.8854</td> <td>0.8828</td> <td>0.9502</td> <td>0.9231</td> <td>0.8351</td> <td>0.8161</td> <td>0.9938</td> <td>0.9043</td> <td>0.8846</td> <td>0.9613</td> <td>0.8978</td> <td>0.7634</td> <td>0.8548</td> <td>0.8095</td>
153
+ </tr>
154
+ <tr>
155
+ <td>GPT-Image</td> <td>❌</td> <td> 0.8952</td> <td>0.8929</td> <td>0.9448</td> <td>0.9289</td> <td>0.8655</td> <td>0.8445</td> <td>0.9494</td> <td>0.9283</td> <td>0.8800</td> <td>0.9432</td> <td>0.9017</td> <td>0.7253</td> <td>0.8582</td> <td>0.7143</td>
156
+ </tr>
157
+ <tr>
158
+ <td><strong>HunyuanImage 2.1</strong></td> <td>βœ…</td> <td><strong>0.8888</strong></td> <td><strong>0.8832</strong></td> <td>0.9339</td> <td>0.9341</td> <td>0.8363</td> <td>0.8342</td> <td>0.9627</td> <td>0.8870</td> <td>0.9615</td> <td>0.9448</td> <td>0.9254</td> <td>0.7527</td> <td>0.8689</td> <td>0.7619</td>
159
+ </tr>
160
+ </tbody>
161
+ </table>
162
+ </p>
163
+
164
+ From the SSAE evaluation results, our model has currently achieved the optimal performance among open-source models in terms of semantic alignment, and is very close to the performance of closed-source commercial models (GPT-Image).
165
+
166
+ ### GSB Evaluation
167
+
168
+ <p align="center">
169
+ <img src="./assets/gsb.png" width=70% alt="Human Evaluation with Other Models">
170
+ </p>
171
+
172
+ We adopted the GSB evaluation method commonly used to assess the relative performance between two models from an overall image perception perspective. In total, we utilized 1000 text prompts, generating an equal number of image samples for all compared models in a single run. For a fair comparison, we conducted inference only once for each prompt, avoiding any cherry-picking of results. When comparing with the baseline methods, we maintained the default settings for all selected models. The evaluation was performed by more than 100 professional evaluators.
173
+ From the results, HunyuanImage 2.1 achieved a relative win rate of -1.36% against Seedream3.0 (closed-source) and 2.89% outperforming Qwen-Image (open-source). The GSB evaluation results demonstrate that HunyuanImage 2.1, as an open-source model, has reached a level of image generation quality comparable to closed-source commercial models (Seedream3.0), while showing certain advantages in comparison with similar open-source models (Qwen-Image). This fully validates the technical advancement and practical value of HunyuanImage 2.1 in text-to-image generation tasks.
174
+
175
+ ## πŸ“œ System Requirements
176
+
177
+
178
+ **Hardware and OS Requirements:**
179
+ - NVIDIA GPU with CUDA support.
180
+ - **Minimum:** 59 GB GPU memory for 2048x2048 image generation (batch size = 1).
181
+ - Supported operating system: Linux.
182
+
183
+ > **Note:** The memory requirements above are measured with model CPU offloading enabled. If your GPU has sufficient memory, you may disable offloading for improved inference speed.
184
+
185
+ ## πŸ› οΈ Dependencies and Installation
186
+
187
+ 1. Clone the repository:
188
+ ```bash
189
+ git clone https://github.com/Tencent-Hunyuan/HunyuanImage-2.1.git
190
+ cd HunyuanImage-2.1
191
+ ```
192
+
193
+ 2. Install dependencies:
194
+ ```bash
195
+ pip install -r requirements.txt
196
+ pip install flash-attn==2.7.3 --no-build-isolation
197
+ ```
198
+
199
+ ## 🧱 Download Pretrained Models
200
+
201
+ The details of download pretrained models are shown [here](ckpts/checkpoints-download.md).
202
+
203
+ ## πŸ”‘ Usage
204
+
205
+ ```python
206
+ import torch
207
+ from hyimage.diffusion.pipelines.hunyuanimage_pipeline import HunyuanImagePipeline
208
+
209
+ # Supported model_name: hunyuanimage-v2.1, hunyuanimage-v2.1-distilled
210
+ model_name = "hunyuanimage-v2.1-distilled"
211
+ pipe = HunyuanImagePipeline.from_pretrained(model_name=model_name, torch_dtype='bf16')
212
+ pipe = pipe.to("cuda")
213
+
214
+ prompt = "A cute, cartoon-style anthropomorphic penguin plush toy with fluffy fur, standing in a painting studio, wearing a red knitted scarf and a red beret with the word β€œTencent” on it, holding a paintbrush with a focused expression as it paints an oil painting of the Mona Lisa, rendered in a photorealistic photographic style."
215
+ image = pipe(
216
+ prompt=prompt,
217
+ width=2048,
218
+ height=2048,
219
+ use_reprompt=True, # Enable prompt enhancement
220
+ use_refiner=True, # Enable refiner for better quality
221
+ # For the distilled model, use 8 steps for faster inference.
222
+ # For the non-distilled model, use 50 steps for better quality
223
+ num_inference_steps=8 if "distilled" in model_name else 50,
224
+ guidance_scale=3.25,
225
+ shift=4,
226
+ seed=649151,
227
+ )
228
+
229
+ image.save(f"generated_image.png")
230
+ ```
231
+
232
+
233
+ ## πŸ”— BibTeX
234
+
235
+ If you find this project useful for your research and applications, please cite as:
236
+
237
+ ```BibTeX
238
+ @misc{HunyuanImage-2.1,
239
+ title={HunyuanImage 2.1: An Efficient Diffusion Model for High-Resolution (2K) Text-to-Image Generation},
240
+ author={Tencent Hunyuan Team},
241
+ year={2025},
242
+ howpublished={\url{https://github.com/Tencent-Hunyuan/HunyuanImage-2.1}},
243
+ }
244
+ ```
245
+
246
+ ## Acknowledgements
247
+
248
+ We would like to thank the following open-source projects and communities for their contributions to open research and exploration: [Qwen](https://huggingface.co/Qwen), [FLUX](https://github.com/black-forest-labs/flux), [diffusers](https://github.com/huggingface/diffusers) and [HuggingFace](https://huggingface.co).
249
+
250
+ ## Github Star History
251
+ <a href="https://star-history.com/#Tencent-Hunyuan/HunyuanImage-2.1&Date">
252
+ <picture>
253
+ <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=Tencent-Hunyuan/HunyuanImage-2.1&type=Date&theme=dark" />
254
+ <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=Tencent-Hunyuan/HunyuanImage-2.1&type=Date" />
255
+ <img alt="Star History Chart" src="https://api.star-history.com/svg?repos=Tencent-Hunyuan/HunyuanImage-2.1&type=Date" />
256
+ </picture>
257
+ </a>
checkpoints-download.md ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ # Download the pretrained checkpoints:
3
+
4
+ First, make sure you have installed the huggingface CLI and modelscope CLI.
5
+
6
+ ```bash
7
+ pip install -U "huggingface_hub[cli]"
8
+ pip install modelscope
9
+ ```
10
+
11
+
12
+ ### Download the pretrained DiT and VAE checkpoints:
13
+ ```bash
14
+ hf download tencent/HunyuanImage-2.1 --local-dir ./ckpts
15
+ ```
16
+
17
+ ### Downloading TextEncoders
18
+
19
+ HunyuanImage uses an MLLM and a byT5 as text encoders.
20
+
21
+ * **MLLM**
22
+
23
+ HunyuanImage can be integrated with different MLLMs (including HunyuanMLLM and other open-source MLLM models).
24
+
25
+ At this stage, we have not yet released the latest HunyuanMLLM. We recommend the users in community to use an open-source alternative, such as Qwen2.5-VL-7B-Instruct provided by Qwen Team, which can be downloaded by the following command:
26
+ ```bash
27
+ hf download Qwen/Qwen2.5-VL-7B-Instruct --local-dir ./ckpts/text_encoder/llm
28
+ ```
29
+
30
+ * **ByT5 encoder**
31
+
32
+ We use [Glyph-SDXL-v2](https://modelscope.cn/models/AI-ModelScope/Glyph-SDXL-v2) as our [byT5](https://github.com/google-research/byt5) encoder, which can be downloaded by the following command:
33
+
34
+ ```bash
35
+ hf download google/byt5-small --local-dir ./ckpts/text_encoder/byt5-small
36
+ modelscope download --model AI-ModelScope/Glyph-SDXL-v2 --local_dir ./ckpts/text_encoder/Glyph-SDXL-v2
37
+ ```
38
+ You can also manually download the checkpoints from [here](https://modelscope.cn/models/AI-ModelScope/Glyph-SDXL-v2/files) and place them in the text_encoder folder like:
39
+ ```
40
+ ckpts
41
+ β”œβ”€β”€ text_encoder
42
+ β”‚Β Β  β”œβ”€β”€ Glyph-SDXL-v2
43
+ β”‚Β Β  β”‚Β Β  β”œβ”€β”€ assets
44
+ β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ color_idx.json
45
+ β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ multilingual_10-lang_idx.json
46
+ β”‚Β Β  β”‚Β Β  β”‚Β Β  └── ...
47
+ β”‚Β Β  β”‚Β Β  └── checkpoints
48
+ β”‚Β Β  β”‚Β Β  Β Β  β”œβ”€β”€ byt5_model.pt
49
+ β”‚Β Β  β”‚Β Β  └── ...
50
+ β”‚ └─ ...
51
+ └─ ...
52
+ ```
53
+
54
+ <details>
55
+
56
+ <summary>πŸ’‘Tips for using hf/huggingface-cli (network problem)</summary>
57
+
58
+ ##### 1. Using HF-Mirror
59
+
60
+ If you encounter slow download speeds in China, you can try a mirror to speed up the download process:
61
+
62
+ ```shell
63
+ HF_ENDPOINT=https://hf-mirror.com hf download tencent/HunyuanImage-2.1 --local-dir ./ckpts
64
+ ```
65
+
66
+ ##### 2. Resume Download
67
+
68
+ `huggingface-cli` supports resuming downloads. If the download is interrupted, you can just rerun the download
69
+ command to resume the download process.
70
+
71
+ Note: If an `No such file or directory: 'ckpts/.huggingface/.gitignore.lock'` like error occurs during the download
72
+ process, you can ignore the error and rerun the download command.
73
+
74
+ </details>