Papers
arxiv:2504.05741

DDT: Decoupled Diffusion Transformer

Published on Apr 8
· Submitted by wangsssssss on Apr 10
#1 Paper of the day
Authors:
,
,

Abstract

Diffusion transformers have demonstrated remarkable generation quality, albeit requiring longer training iterations and numerous inference steps. In each denoising step, diffusion transformers encode the noisy inputs to extract the lower-frequency semantic component and then decode the higher frequency with identical modules. This scheme creates an inherent optimization dilemma: encoding low-frequency semantics necessitates reducing high-frequency components, creating tension between semantic encoding and high-frequency decoding. To resolve this challenge, we propose a new \color{ddtD}ecoupled \color{ddtD}iffusion \color{ddtT}ransformer~(\color{ddtDDT}), with a decoupled design of a dedicated condition encoder for semantic extraction alongside a specialized velocity decoder. Our experiments reveal that a more substantial encoder yields performance improvements as model size increases. For ImageNet 256times256, Our DDT-XL/2 achieves a new state-of-the-art performance of {1.31 FID}~(nearly 4times faster training convergence compared to previous diffusion transformers). For ImageNet 512times512, Our DDT-XL/2 achieves a new state-of-the-art FID of 1.28. Additionally, as a beneficial by-product, our decoupled architecture enhances inference speed by enabling the sharing self-condition between adjacent denoising steps. To minimize performance degradation, we propose a novel statistical dynamic programming approach to identify optimal sharing strategies.

Community

Paper author Paper submitter

DDT: Decoupled Diffusion Transformer

  • Decouple diffusion transformer into (heavy)encoder-(light)decoder arch.
  • 675M model achieves 1.26FID on ImageNet256 and 1.28 FID on ImageNet512.

TL;DR: Diffusion transformers have demonstrated remarkable generation quality, albeit requiring longer training iterations and numerous inference steps. In each denoising step, diffusion transformers encode the noisy inputs to extract the lower-frequency semantic component and then decode the higher frequency with identical modules. This scheme creates an inherent optimization dilemma: encoding low-frequency semantics necessitates reducing high-frequency components, creating tension between semantic encoding and high-frequency decoding. To resolve this challenge, we propose a new Decoupled Diffusion Transformer(DDT), with a decoupled design of a dedicated condition encoder for semantic extraction alongside a specialized velocity decoder. Our experiments reveal that a more substantial encoder yields performance improvements as model size increases. Our DDT-XL/2 (22En6De) achieves a new state-of-the-art performance of 1.31 FID within only 256 training epochs(nearly
faster training convergence compared to previous diffusion transformers). Additionally, as a beneficial by-product, our decoupled architecture enhances inference speed by enabling the sharing self-condition between adjacent denoising steps. To minimize performance degradation, we propose a novel statistical dynamic programming approach to identify optimal sharing strategies.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Hi Authors of DDT,

Congrats on the amazing work!
The idea for decoupling semantic information from high frequency information is very interesting!
I'd like to share a similar work from my team on visual tokenizers that also points out the semantic-spectrum coupling phenomenon, semanticist
Our work decouples the semantic learning and frequency decoding in a visual tokenizer and enables flexible visual tokenization.

Thanks again for sharing your work!

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.05741 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 7