HunyuanDiT
Diffusers
Safetensors
English
Chinese
Tencent-Hunyuan commited on
Commit
942fd1e
1 Parent(s): 787a186

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +168 -85
README.md CHANGED
@@ -1,45 +1,69 @@
1
  <!-- ## **HunyuanDiT** -->
2
- <!-- [[Technical Report]()] &emsp; [[Project Page]()] &emsp; [[Model Card]()] <br>
3
 
4
- [[🤗 Demo (Realistic)]()] &emsp; -->
5
  <p align="center">
6
- <img src="./asset/logo.png" height=30>
7
  </p>
8
 
9
- <div align="center" style="font-size: 30px;font-weight: bold;">Hunyuan-DiT : A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding</div>
10
-
11
- <!-- <div align="center">
12
- <a href="https://github.com/Tencent/HunyuanDiT"><img src="https://img.shields.io/static/v1?label=Hunyuan-DiT Code&message=Github&color=blue&logo=github-pages"></a> &ensp;
13
- <a href="https://dit.hunyuan.tencent.com"><img src="https://img.shields.io/static/v1?label=Project%20Page&message=Github&color=blue&logo=github-pages"></a> &ensp;
14
- <a href="https://arxiv.org/abs/"><img src="https://img.shields.io/static/v1?label=Paper&message=Arxiv:HunYuan-DiT&color=red&logo=arxiv"></a> &ensp;
15
- <a href="https://arxiv.org/abs/2403.08857"><img src="https://img.shields.io/static/v1?label=Paper&message=Arxiv:DialogGen&color=red&logo=arxiv"></a> &ensp;
16
- <a href="https://huggingface.co/Tencent-Hunyuan/Hunyuan-DiT"><img src="https://img.shields.io/static/v1?label=Hunyuan-DiT&message=HuggingFace&color=yellow"></a> &ensp;
17
-
18
- </div> -->
19
-
20
-
21
- <!-- ## Contents
22
- * [Dependencies and Installation](#-Dependencies-and-Installation)
23
- * [Inference](#-Inference)
24
- * [Download Models](#-download-models)
25
-
26
- * [Acknowledgement](#acknowledgements)
27
- * [Citation](#bibtex) -->
28
-
29
- # **Abstract**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
  We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully designed the transformer structure, text encoder, and positional encoding. We also build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images. Finally, Hunyuan-DiT can perform multi-round multi-modal dialogue with users, generating and refining images according to the context.
32
  Through our carefully designed holistic human evaluation protocol with more than 50 professional human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared with other open-source models.
33
 
34
 
35
- # **Hunyuan-DiT Key Features**
36
- ## **Chinese-English Bilingual DiT Architecture**
37
- We propose HunyuanDiT, a text-to-image generation model based on Diffusion transformer with fine-grained understanding of Chinese and English. In order to build Hunyuan DiT, we carefully designed the Transformer structure, text encoder and positional encoding. We also built a complete data pipeline from scratch to update and evaluate data to help model optimization iterations. To achieve fine-grained text understanding, we train a multi-modal large language model to optimize text descriptions of images. Ultimately, Hunyuan DiT is able to conduct multiple rounds of dialogue with users, generating and improving images based on context.
38
  <p align="center">
39
  <img src="./asset/framework.png" height=500>
40
  </p>
41
 
42
- ## **Multi-turn Text2Image Generation**
43
  Understanding natural language instructions and performing multi-turn interaction with users are important for a
44
  text-to-image system. It can help build a dynamic and iterative creation process that bring the user’s idea into reality
45
  step by step. In this section, we will detail how we empower Hunyuan-DiT with the ability to perform multi-round
@@ -49,89 +73,116 @@ and output the new text prompt for image generation.
49
  <img src="./asset/mllm.png" height=300>
50
  </p>
51
 
52
- ## **Comparisons**
53
  In order to comprehensively compare the generation capabilities of HunyuanDiT and other models, we constructed a 4-dimensional test set, including Text-Image Consistency, Excluding AI Artifacts, Subject Clarity, Aesthetic. More than 50 professional evaluators performs the evaluation.
54
 
55
  <p align="center">
56
  <table>
57
  <thead>
58
  <tr>
59
- <th rowspan="2">Type</th> <th rowspan="2">Model</th> <th>Text-Image Consistency (%)</th> <th>Excluding AI Artifacts (%)</th> <th>Subject Clarity (%)</th> <th rowspan="2">Aesthetics (%)</th> <th rowspan="2">Overall (%)</th>
60
  </tr>
61
  </thead>
62
  <tbody>
63
  <tr>
64
- <td>SDXL</td> <td>64.3</td> <td>60.6</td> <td>91.1</td> <td>76.3</td> <td>42.7</td>
65
  </tr>
66
-
67
  <tr>
68
- <td>Playground 2.5</td> <td>71.9</td> <td>70.8</td> <td>94.9</td> <td>83.3</td> <td>54.3</td>
69
  </tr>
70
- <tr>
71
- <td>SD 3</td> <td>77.1</td> <td>69.3</td> <td>94.6</td> <td>82.5</td> <td>56.7</td>
72
  </tr>
73
- <tr style="font-weight: bold; background-color: #f2f2f2;"> <td>Hunyuan-DiT</td> <td>74.2</td> <td>74.3</td> <td>95.4</td> <td>86.6</td> <td>59.0</td> </tr>
74
 
75
  <tr>
76
- <td>MidJourney v6</td> <td>73.5</td> <td>80.2</td> <td>93.5</td> <td>87.2</td> <td>63.3</td>
 
77
  </tr>
 
 
 
78
  <tr>
79
- <td>DALL-E 3</td> <td>83.9</td> <td>80.3</td> <td>96.5</td> <td>89.4</td> <td>71.0</td>
 
 
 
80
  </tr>
81
  </table>
82
  </p>
83
 
84
- ## **Visualization**
85
 
86
  * **Chinese Elements**
87
  <p align="center">
88
- <img src="./asset/chinese elements understanding.png" height=280>
89
  </p>
90
 
91
  * **Long Text Input**
92
 
93
 
94
  <p align="center">
95
- <img src="./asset/long text understanding.png" height=900>
96
- <figcaption>Comparison between Hunyuan-DiT and other text-to-image models. The image with the highest resolution on the far left is the result of Hunyuan-Dit. The others, from top left to bottom right, are as follows: Dalle3, Midjourney v6, SD3, Playground 2.5, PixArt, SDXL, Baidu Yige, WanXiang.
97
  </p>
98
 
99
  * **Multi-turn Text2Image Generation**
100
- <p align="center">
101
- <a href="https://prc-videoframe-pub-1258344703.cos.ap-guangzhou.myqcloud.com/ad_creative_engine/projectpage/1deab38689342431e63606e01e16961c.mov">
102
- <img src="./asset/cover.png" alt="Watch the video" height="800">
103
- </a>
104
- </p>
105
 
106
- # **Dependencies and Installation**
107
- Ensure your machine is equipped with a GPU having over 20GB of memory.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
108
 
109
  Begin by cloning the repository:
110
  ```bash
111
  git clone https://github.com/tencent/HunyuanDiT
112
  cd HunyuanDiT
113
  ```
114
- We provide an `environment.yml` file for setting up a Conda environment.
115
 
116
-
117
- Installation instructions for Conda are available [here](https://docs.anaconda.com/free/miniconda/index.html).
118
 
119
  ```bash
120
- # Prepare conda environment
121
  conda env create -f environment.yml
122
 
123
- # Activate the environment
124
  conda activate HunyuanDiT
125
 
126
- # Install pip dependencies
127
  python -m pip install -r requirements.txt
128
 
129
- # Install flash attention v2 (for acceleration, requires CUDA 11.6 or above)
130
  python -m pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.1.2.post3
131
  ```
132
 
133
- # **Download Models**
134
- To download the model, first install the huggingface-cli. Installation instructions are available [here](https://huggingface.co/docs/huggingface_hub/guides/cli):
 
 
 
 
 
 
135
 
136
  ```bash
137
  # Create a directory named 'ckpts' where the model will be saved, fulfilling the prerequisites for running the demo.
@@ -140,54 +191,86 @@ mkdir ckpts
140
  # The download time may vary from 10 minutes to 1 hour depending on network conditions.
141
  huggingface-cli download Tencent-Hunyuan/HunyuanDiT --local-dir ./ckpts
142
  ```
143
- <!-- For more information about the model, visit the Hugging Face repository [here](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT). -->
 
 
144
 
145
 
146
  All models will be automatically downloaded. For more information about the model, visit the Hugging Face repository [here](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT).
147
 
148
- | Model | #Params | url|
149
- |:-----------------|:--------|:--------------|
150
- |mT5 | xxB | [mT5](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT/tree/main/t2i/mt5)|
151
- | CLIP | xxB | [CLIP](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT/tree/main/t2i/clip_text_encoder)|
152
- | DialogGen | 7B | [DialogGen](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT/tree/main/dialoggen)|
153
- | sdxl-vae-fp16-fix | xxB | [sdxl-vae-fp16-fix](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT/tree/main/t2i/sdxl-vae-fp16-fix)|
154
- | Hunyuan-DiT | xxB | [Hunyuan-DiT](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT/tree/main/t2i/model)|
155
 
156
 
 
157
 
158
- # **Inference**
159
- ```bash
160
- # prompt-enhancement + text2image, torch mode
161
- python sample_t2i.py --prompt "渔舟唱晚"
162
 
163
- # close prompt enhancement, torch mode
164
- python sample_t2i.py --prompt "渔舟唱晚" --no-enhance
165
 
166
- # close prompt enhancement, flash attention mode
167
- python sample_t2i.py --infer-mode fa --prompt "渔舟唱晚"
 
 
 
 
 
 
 
 
 
 
 
168
  ```
169
- more example prompts can be found in [example_prompts.txt](example_prompts.txt)
170
 
171
- Note: 20G GPU memory is used for sampling in single GPU
 
 
 
 
 
 
172
 
 
 
173
 
174
- <!-- # **To-Do List**
 
175
 
176
- - [x] Inference code
177
- - [ ] Provide Tensorrt engine -->
 
178
 
 
179
 
 
180
 
 
 
 
 
 
 
 
 
 
 
 
 
181
 
182
 
183
- # **BibTeX**
184
  If you find Hunyuan-DiT useful for your research and applications, please cite using this BibTeX:
185
 
186
  ```BibTeX
187
- @inproceedings{,
188
- title={},
189
- author={},
190
- booktitle={},
191
- year={2024}
192
  }
193
  ```
 
1
  <!-- ## **HunyuanDiT** -->
 
2
 
 
3
  <p align="center">
4
+ <img src="./asset/logo.png" height=100>
5
  </p>
6
 
7
+ # Hunyuan-DiT : A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding
8
+
9
+ -----
10
+
11
+ This repo contains PyTorch model definitions, pre-trained weights and inference/sampling code for our paper exploring Hunyuan-DiT. You can find more visualizations on our [project page](https://dit.hunyuan.tencent.com/).
12
+
13
+ > [**Hunyuan-DiT : A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding**](https://github.com/Tencent/HunyuanDiT/blob/main/asset/Hunyuan_DiT_Tech_Report_05140553.pdf) <br>
14
+ > Zhimin Li*, Jianwei Zhang*, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, Jinbao Xue, Yangyu Tao, JianChen Zhu, Kai Liu, Sihuan Lin, Yifu Sun, Yun Li, Dongdong Wang, Zhichao Hu, Xiao Xiao, Yan Chen, Yuhong Liu, Wei Liu, Di Wang, Yong Yang, Jie Jiang, Qinglin Lu‡
15
+ > <br>Tencent Hunyuan<br>
16
+
17
+ > [**DialogGen:Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation**](https://hunyuan-dialoggen.github.io/)<br>
18
+ > Minbin Huang*, Yanxin Long*, Xinchi Deng, Ruihang Chu, Jiangfeng Xiong, Xiaodan Liang, Hong Cheng, Qinglin Lu&#8224;, Wei Liu
19
+ > <br>Chinese University of Hong Kong, Tencent Hunyuan, Shenzhen Campus of Sun Yat-sen University<br>
20
+
21
+ ---
22
+
23
+ ## 📑 Open-source Plan
24
+
25
+ - Hunyuan-DiT (Text-to-Image Model)
26
+ - [x] Inference ✅
27
+ - [x] Checkpoints ✅
28
+ - [ ] Distillation Version (Coming soon ⏩️)
29
+ - [ ] TensorRT Version (Coming soon ⏩️)
30
+ - [ ] Training (Coming later ⏩️)
31
+ - DialogGen (Prompt Enhancement Model)
32
+ - [x] Inference ✅
33
+ - [X] Web Demo (Gradio) ✅
34
+ - [X] Cli Demo ✅
35
+
36
+ ## Contents
37
+ - [Hunyuan-DiT](#hunyuan-dit-a-powerful-multi-resolution-diffusion-transformer-with-fine-grained-chinese-understanding)
38
+ - [Abstract](#abstract)
39
+ - [🎉 Hunyuan-DiT Key Features](#hunyuan-dit-key-features)
40
+ - [Chinese-English Bilingual DiT Architecture](#chinese-english-bilingual-dit-architecture)
41
+ - [Multi-turn Text2Image Generation](#multi-turn-text2image-generation)
42
+ - [📈 Comparisons](#comparisons)
43
+ - [🎥 Visualization](#visualization)
44
+ - [📜 Requirements](#requirements)
45
+ - [🛠 Dependencies and Installation](#dependencies-and-installation)
46
+ - [🧱 Download Pretrained Models](#download-pretrained-models)
47
+ - [🔑 Inference](#inference)
48
+ - [Using Gradio](#using-gradio)
49
+ - [Using Command Line](#using-command-line)
50
+ - [More Configurations](#more-configurations)
51
+ - [🔗 BibTeX](#bibtex)
52
+
53
+ ## **Abstract**
54
 
55
  We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully designed the transformer structure, text encoder, and positional encoding. We also build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images. Finally, Hunyuan-DiT can perform multi-round multi-modal dialogue with users, generating and refining images according to the context.
56
  Through our carefully designed holistic human evaluation protocol with more than 50 professional human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared with other open-source models.
57
 
58
 
59
+ ## 🎉 **Hunyuan-DiT Key Features**
60
+ ### **Chinese-English Bilingual DiT Architecture**
61
+ Hunyuan-DiT is a diffusion model in the latent space, as depicted in figure below. Following the Latent Diffusion Model, we use a pre-trained Variational Autoencoder (VAE) to compress the images into low-dimensional latent spaces and train a diffusion model to learn the data distribution with diffusion models. Our diffusion model is parameterized with a transformer. To encode the text prompts, we leverage a combination of pre-trained bilingual (English and Chinese) CLIP and multilingual T5 encoder.
62
  <p align="center">
63
  <img src="./asset/framework.png" height=500>
64
  </p>
65
 
66
+ ### Multi-turn Text2Image Generation
67
  Understanding natural language instructions and performing multi-turn interaction with users are important for a
68
  text-to-image system. It can help build a dynamic and iterative creation process that bring the user’s idea into reality
69
  step by step. In this section, we will detail how we empower Hunyuan-DiT with the ability to perform multi-round
 
73
  <img src="./asset/mllm.png" height=300>
74
  </p>
75
 
76
+ ## Comparisons
77
  In order to comprehensively compare the generation capabilities of HunyuanDiT and other models, we constructed a 4-dimensional test set, including Text-Image Consistency, Excluding AI Artifacts, Subject Clarity, Aesthetic. More than 50 professional evaluators performs the evaluation.
78
 
79
  <p align="center">
80
  <table>
81
  <thead>
82
  <tr>
83
+ <th rowspan="2">Model</th> <th rowspan="2">Open Source</th> <th>Text-Image Consistency (%)</th> <th>Excluding AI Artifacts (%)</th> <th>Subject Clarity (%)</th> <th rowspan="2">Aesthetics (%)</th> <th rowspan="2">Overall (%)</th>
84
  </tr>
85
  </thead>
86
  <tbody>
87
  <tr>
88
+ <td>SDXL</td> <td>&#10004</td> <td>64.3</td> <td>60.6</td> <td>91.1</td> <td>76.3</td> <td>42.7</td>
89
  </tr>
 
90
  <tr>
91
+ <td>PixArt-α</td> <td>&#10004</td> <td>68.3</td> <td>60.9</td> <td>93.2</td> <td>77.5</td> <td>45.5</td>
92
  </tr>
93
+ <tr>
94
+ <td>Playground 2.5</td> <td>&#10004</td> <td>71.9</td> <td>70.8</td> <td>94.9</td> <td>83.3</td> <td>54.3</td>
95
  </tr>
 
96
 
97
  <tr>
98
+ <td>SD 3</td> <td>&#10008</td> <td>77.1</td> <td>69.3</td> <td>94.6</td> <td>82.5</td> <td>56.7</td>
99
+
100
  </tr>
101
+ <tr style="font-weight: bold; background-color: #f2f2f2;">
102
+ <td>Hunyuan-DiT</td><td>&#10004</td> <td>74.2</td> <td>74.3</td> <td>95.4</td> <td>86.6</td> <td>59.0</td>
103
+ </tr>
104
  <tr>
105
+ <td>MidJourney v6</td><td>&#10008</td> <td>73.5</td> <td>80.2</td> <td>93.5</td> <td>87.2</td> <td>63.3</td>
106
+ </tr>
107
+ <tr>
108
+ <td>DALL-E 3</td><td>&#10008</td> <td>83.9</td> <td>80.3</td> <td>96.5</td> <td>89.4</td> <td>71.0</td>
109
  </tr>
110
  </table>
111
  </p>
112
 
113
+ ## 🎥Visualization
114
 
115
  * **Chinese Elements**
116
  <p align="center">
117
+ <img src="./asset/chinese elements understanding.png" height=220>
118
  </p>
119
 
120
  * **Long Text Input**
121
 
122
 
123
  <p align="center">
124
+ <img src="./asset/long text understanding.png" height=310>
 
125
  </p>
126
 
127
  * **Multi-turn Text2Image Generation**
 
 
 
 
 
128
 
129
+ https://github.com/yestinl/MDINO/assets/27557933/084ac599-73ce-4be9-9ba9-1b69354f64f8
130
+
131
+ ---
132
+
133
+ ## 📜Requirements
134
+
135
+ This repo consists of DialogGen (a prompt enhancement model) and Hunyuan-DiT (a text-to-image model).
136
+
137
+ The following table shows the requirements for running the models (The TensorRT version will be updated soon):
138
+
139
+ | Model | TensorRT | Batch Size | GPU Memory | GPU |
140
+ |:------------------------:|:--------:|:----------:|:----------:|:---------:|
141
+ | DialogGen + Hunyuan-DiT | ✘ | 1 | 32G | V100/A100 |
142
+ | Hunyuan-DiT | ✘ | 1 | 11G | V100/A100 |
143
+
144
+ <!-- | DialogGen + Hunyuan-DiT | ✔ | 1 | ? | A100 |
145
+ | Hunyuan-DiT | ✔ | 1 | ? | A100 | -->
146
+
147
+ * An NVIDIA GPU with CUDA support is required.
148
+ * We have tested V100 and A100 GPUs.
149
+ * **Minimum**: The minimum GPU memory required is 11GB.
150
+ * **Recommended**: We recommend using a GPU with 32GB of memory for better generation quality.
151
+ * Tested operating system: Linux
152
+
153
+ ## 🛠️Dependencies and Installation
154
 
155
  Begin by cloning the repository:
156
  ```bash
157
  git clone https://github.com/tencent/HunyuanDiT
158
  cd HunyuanDiT
159
  ```
 
160
 
161
+ We provide an `environment.yml` file for setting up a Conda environment.
162
+ Conda's installation instructions are available [here](https://docs.anaconda.com/free/miniconda/index.html).
163
 
164
  ```bash
165
+ # 1. Prepare conda environment
166
  conda env create -f environment.yml
167
 
168
+ # 2. Activate the environment
169
  conda activate HunyuanDiT
170
 
171
+ # 3. Install pip dependencies
172
  python -m pip install -r requirements.txt
173
 
174
+ # 4. (Optional) Install flash attention v2 for acceleration (requires CUDA 11.6 or above)
175
  python -m pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.1.2.post3
176
  ```
177
 
178
+ ## 🧱Download Pretrained Models
179
+ To download the model, first install the huggingface-cli. (Detailed instructions are available [here](https://huggingface.co/docs/huggingface_hub/guides/cli).)
180
+
181
+ ```bash
182
+ python -m pip install "huggingface_hub[cli]"
183
+ ```
184
+
185
+ Then download the model using the following commands:
186
 
187
  ```bash
188
  # Create a directory named 'ckpts' where the model will be saved, fulfilling the prerequisites for running the demo.
 
191
  # The download time may vary from 10 minutes to 1 hour depending on network conditions.
192
  huggingface-cli download Tencent-Hunyuan/HunyuanDiT --local-dir ./ckpts
193
  ```
194
+ Note:If an `No such file or directory: 'ckpts/.huggingface/.gitignore.lock'` like error occurs during the download process, you can ignore the error and retry the command by executing `huggingface-cli download Tencent-Hunyuan/HunyuanDiT --local-dir ./ckpts`
195
+
196
+ For more information about the model, visit the Hugging Face repository [here](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT).
197
 
198
 
199
  All models will be automatically downloaded. For more information about the model, visit the Hugging Face repository [here](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT).
200
 
201
+ | Model | #Params | Download URL |
202
+ |:------------------:|:-------:|:-------------------------------------------------------------------------------------------------------:|
203
+ | mT5 | 1.6B | [mT5](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT/tree/main/t2i/mt5) |
204
+ | CLIP | 350M | [CLIP](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT/tree/main/t2i/clip_text_encoder) |
205
+ | DialogGen | 7.0B | [DialogGen](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT/tree/main/dialoggen) |
206
+ | sdxl-vae-fp16-fix | 83M | [sdxl-vae-fp16-fix](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT/tree/main/t2i/sdxl-vae-fp16-fix) |
207
+ | Hunyuan-DiT | 1.5B | [Hunyuan-DiT](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT/tree/main/t2i/model) |
208
 
209
 
210
+ ## 🔑 Inference
211
 
212
+ ### Using Gradio
 
 
 
213
 
214
+ Make sure you have activated the conda environment before running the following command.
 
215
 
216
+ ```shell
217
+ # By default, we start a Chinese UI.
218
+ python app/hydit_app.py
219
+
220
+ # Using Flash Attention for acceleration.
221
+ python app/hydit_app.py --infer-mode fa
222
+
223
+ # You can disable the enhancement model if the GPU memory is insufficient.
224
+ # The enhancement will be unavailable until you restart the app without the `--no-enhance` flag.
225
+ python app/hydit_app.py --no-enhance
226
+
227
+ # Start with English UI
228
+ python app/hydit_app.py --lang en
229
  ```
 
230
 
231
+ ### Using Command Line
232
+
233
+ We provide 3 modes to quick start:
234
+
235
+ ```bash
236
+ # Prompt Enhancement + Text-to-Image. Torch mode
237
+ python sample_t2i.py --prompt "渔舟唱晚"
238
 
239
+ # Only Text-to-Image. Torch mode
240
+ python sample_t2i.py --prompt "渔舟唱晚" --no-enhance
241
 
242
+ # Only Text-to-Image. Flash Attention mode
243
+ python sample_t2i.py --infer-mode fa --prompt "渔舟唱晚"
244
 
245
+ # Generate an image with other image sizes.
246
+ python sample_t2i.py --prompt "渔舟唱晚" --image-size 1280 768
247
+ ```
248
 
249
+ ### More Configurations
250
 
251
+ We list some more useful configurations for easy usage:
252
 
253
+ | Argument | Default | Description |
254
+ |:---------------:|:---------:|:---------------------------------------------------:|
255
+ | `--prompt` | None | The text prompt for image generation |
256
+ | `--image-size` | 1024 1024 | The size of the generated image |
257
+ | `--seed` | 42 | The random seed for generating images |
258
+ | `--infer-steps` | 100 | The number of steps for sampling |
259
+ | `--negative` | - | The negative prompt for image generation |
260
+ | `--infer-mode` | torch | The inference mode (torch or fa) |
261
+ | `--sampler` | ddpm | The diffusion sampler (ddpm, ddim, or dpmms) |
262
+ | `--no-enhance` | False | Disable the prompt enhancement model |
263
+ | `--model-root` | ckpts | The root directory of the model checkpoints |
264
+ | `--load-key` | ema | Load the student model or EMA model (ema or module) |
265
 
266
 
267
+ # 🔗 BibTeX
268
  If you find Hunyuan-DiT useful for your research and applications, please cite using this BibTeX:
269
 
270
  ```BibTeX
271
+ @misc{hunyuandit,
272
+ title={Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding},
273
+ author={Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, Jinbao Xue, Yangyu Tao, JianChen Zhu, Kai Liu, Sihuan Lin, Yifu Sun, Yun Li, Dongdong Wang, Zhichao Hu, Xiao Xiao, Yan Chen, Yuhong Liu, Wei Liu, Di Wang, Yong Yang, Jie Jiang, Qinglin Lu},
274
+ year={2024},
 
275
  }
276
  ```