Abstract
Diffusion transformers have demonstrated remarkable generation quality, albeit requiring longer training iterations and numerous inference steps. In each denoising step, diffusion transformers encode the noisy inputs to extract the lower-frequency semantic component and then decode the higher frequency with identical modules. This scheme creates an inherent optimization dilemma: encoding low-frequency semantics necessitates reducing high-frequency components, creating tension between semantic encoding and high-frequency decoding. To resolve this challenge, we propose a new \color{ddtD}ecoupled \color{ddtD}iffusion \color{ddtT}ransformer~(\color{ddtDDT}), with a decoupled design of a dedicated condition encoder for semantic extraction alongside a specialized velocity decoder. Our experiments reveal that a more substantial encoder yields performance improvements as model size increases. For ImageNet 256times256, Our DDT-XL/2 achieves a new state-of-the-art performance of {1.31 FID}~(nearly 4times faster training convergence compared to previous diffusion transformers). For ImageNet 512times512, Our DDT-XL/2 achieves a new state-of-the-art FID of 1.28. Additionally, as a beneficial by-product, our decoupled architecture enhances inference speed by enabling the sharing self-condition between adjacent denoising steps. To minimize performance degradation, we propose a novel statistical dynamic programming approach to identify optimal sharing strategies.
Community
DDT: Decoupled Diffusion Transformer
- Decouple diffusion transformer into (heavy)encoder-(light)decoder arch.
- 675M model achieves 1.26FID on ImageNet256 and 1.28 FID on ImageNet512.
TL;DR: Diffusion transformers have demonstrated remarkable generation quality, albeit requiring longer training iterations and numerous inference steps. In each denoising step, diffusion transformers encode the noisy inputs to extract the lower-frequency semantic component and then decode the higher frequency with identical modules. This scheme creates an inherent optimization dilemma: encoding low-frequency semantics necessitates reducing high-frequency components, creating tension between semantic encoding and high-frequency decoding. To resolve this challenge, we propose a new Decoupled Diffusion Transformer(DDT), with a decoupled design of a dedicated condition encoder for semantic extraction alongside a specialized velocity decoder. Our experiments reveal that a more substantial encoder yields performance improvements as model size increases. Our DDT-XL/2 (22En6De) achieves a new state-of-the-art performance of 1.31 FID within only 256 training epochs(nearly
faster training convergence compared to previous diffusion transformers). Additionally, as a beneficial by-product, our decoupled architecture enhances inference speed by enabling the sharing self-condition between adjacent denoising steps. To minimize performance degradation, we propose a novel statistical dynamic programming approach to identify optimal sharing strategies.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation (2025)
- LEDiT: Your Length-Extrapolatable Diffusion Transformer without Positional Encoding (2025)
- OminiControl2: Efficient Conditioning for Diffusion Transformers (2025)
- FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute (2025)
- Toward Lightweight and Fast Decoders for Diffusion Models in Image and Video Generation (2025)
- Personalize Anything for Free with Diffusion Transformer (2025)
- Post-Training Quantization for Diffusion Transformer via Hierarchical Timestep Grouping (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Hi Authors of DDT,
Congrats on the amazing work!
The idea for decoupling semantic information from high frequency information is very interesting!
I'd like to share a similar work from my team on visual tokenizers that also points out the semantic-spectrum coupling phenomenon, semanticist
Our work decouples the semantic learning and frequency decoding in a visual tokenizer and enables flexible visual tokenization.
Thanks again for sharing your work!
Models citing this paper 2
Datasets citing this paper 0
No dataset linking this paper