Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis
Abstract
This work presents Switti, a scale-wise transformer for text-to-image generation. Starting from existing next-scale prediction AR models, we first explore them for T2I generation and propose architectural modifications to improve their convergence and overall performance. We then observe that self-attention maps of our pretrained scale-wise AR model exhibit weak dependence on preceding scales. Based on this insight, we propose a non-AR counterpart facilitating {sim}11% faster sampling and lower memory usage while also achieving slightly better generation quality.Furthermore, we reveal that classifier-free guidance at high-resolution scales is often unnecessary and can even degrade performance. %may be not only unnecessary but potentially detrimental. By disabling guidance at these scales, we achieve an additional sampling acceleration of {sim}20% and improve the generation of fine-grained details. Extensive human preference studies and automated evaluations show that Switti outperforms existing T2I AR models and competes with state-of-the-art T2I diffusion models while being up to 7{times} faster.
Community
Switti: a scale-wise transformer for fast text-to-image generation that outperforms existing visual AR models and competes with state-of-the-art T2I diffusion models while being up to 7x faster.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- HART: Efficient Visual Generation with Hybrid Autoregressive Transformer (2024)
- Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient (2024)
- DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation (2024)
- AsCAN: Asymmetric Convolution-Attention Networks for Efficient Recognition and Generation (2024)
- Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens (2024)
- ENAT: Rethinking Spatial-temporal Interactions in Token-based Image Synthesis (2024)
- Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 3
Datasets citing this paper 0
No dataset linking this paper