arxiv:2504.05741

DDT: Decoupled Diffusion Transformer

Published on Apr 8

· Submitted by

wangsssssss on Apr 10

#1 Paper of the day

Upvote

Authors:

Shuai Wang ,

Limin Wang

Abstract

Diffusion transformers have demonstrated remarkable generation quality, albeit requiring longer training iterations and numerous inference steps. In each denoising step, diffusion transformers encode the noisy inputs to extract the lower-frequency semantic component and then decode the higher frequency with identical modules. This scheme creates an inherent optimization dilemma: encoding low-frequency semantics necessitates reducing high-frequency components, creating tension between semantic encoding and high-frequency decoding. To resolve this challenge, we propose a new \color{ddtD}ecoupled \color{ddtD}iffusion \color{ddtT}ransformer~(\color{ddtDDT}), with a decoupled design of a dedicated condition encoder for semantic extraction alongside a specialized velocity decoder. Our experiments reveal that a more substantial encoder yields performance improvements as model size increases. For ImageNet 256times256, Our DDT-XL/2 achieves a new state-of-the-art performance of {1.31 FID}~(nearly 4times faster training convergence compared to previous diffusion transformers). For ImageNet 512times512, Our DDT-XL/2 achieves a new state-of-the-art FID of 1.28. Additionally, as a beneficial by-product, our decoupled architecture enhances inference speed by enabling the sharing self-condition between adjacent denoising steps. To minimize performance degradation, we propose a novel statistical dynamic programming approach to identify optimal sharing strategies.

View arXiv page View PDF GitHub repository Add to collection

Community

wangsssssss

Paper author Paper submitter 3 days ago

DDT: Decoupled Diffusion Transformer

Decouple diffusion transformer into (heavy)encoder-(light)decoder arch.
675M model achieves 1.26FID on ImageNet256 and 1.28 FID on ImageNet512.

TL;DR: Diffusion transformers have demonstrated remarkable generation quality, albeit requiring longer training iterations and numerous inference steps. In each denoising step, diffusion transformers encode the noisy inputs to extract the lower-frequency semantic component and then decode the higher frequency with identical modules. This scheme creates an inherent optimization dilemma: encoding low-frequency semantics necessitates reducing high-frequency components, creating tension between semantic encoding and high-frequency decoding. To resolve this challenge, we propose a new Decoupled Diffusion Transformer(DDT), with a decoupled design of a dedicated condition encoder for semantic extraction alongside a specialized velocity decoder. Our experiments reveal that a more substantial encoder yields performance improvements as model size increases. Our DDT-XL/2 (22En6De) achieves a new state-of-the-art performance of 1.31 FID within only 256 training epochs(nearly
faster training convergence compared to previous diffusion transformers). Additionally, as a beneficial by-product, our decoupled architecture enhances inference speed by enabling the sharing self-condition between adjacent denoising steps. To minimize performance degradation, we propose a novel statistical dynamic programming approach to identify optimal sharing strategies.

librarian-bot

2 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

tennant

1 day ago

Hi Authors of DDT,

Congrats on the amazing work!
The idea for decoupling semantic information from high frequency information is very interesting!
I'd like to share a similar work from my team on visual tokenizers that also points out the semantic-spectrum coupling phenomenon, semanticist
Our work decouples the semantic learning and frequency decoding in a visual tokenizer and enables flexible visual tokenization.

Thanks again for sharing your work!