PixArt-Ξ±: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Published on Sep 30, 2023
Β· Featured in Daily Papers on Oct 3, 2023


The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PIXART-alpha, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution image synthesis up to 1024px resolution with low training cost, as shown in Figure 1 and 2. To achieve this goal, three core designs are proposed: (1) Training strategy decomposition: We devise three distinct training steps that separately optimize pixel dependency, text-image alignment, and image aesthetic quality; (2) Efficient T2I Transformer: We incorporate cross-attention modules into Diffusion Transformer (DiT) to inject text conditions and streamline the computation-intensive class-condition branch; (3) High-informative data: We emphasize the significance of concept density in text-image pairs and leverage a large Vision-Language model to auto-label dense pseudo-captions to assist text-image alignment learning. As a result, PIXART-alpha's training speed markedly surpasses existing large-scale T2I models, e.g., PIXART-alpha only takes 10.8% of Stable Diffusion v1.5's training time (675 vs. 6,250 A100 GPU days), saving nearly \300,000 (26,000 vs. \320,000) and reducing 90% CO2 emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training cost is merely 1%. Extensive experiments demonstrate that PIXART-\alpha excels in image quality, artistry, and semantic control. We hope PIXART-\alpha$ will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.


Disclaimer: πŸ’₯ AI-generated summary:


The paper introduces PIXART-Ξ±, a Transformer-based text-to-image diffusion model that achieves near state-of-the-art image generation quality while significantly reducing training costs and CO2 emissions compared to other models.

The key contributions are: 1) Training strategy decomposition into pixel dependency learning, text-image alignment, and aesthetic enhancement stages; 2) An efficient T2I Transformer architecture incorporating cross-attention and optimized normalization; 3) Using an auto-labeling pipeline with LLaVA to create a high-information-density text-image dataset.


The model is based on Diffusion Transformer (DiT) with additional cross-attention modules to inject text conditions.
Training is divided into 3 main stages:
Stage 1: Learn pixel distributions using a class-condition model pretrained on ImageNet.
Stage 2: Learn text-image alignment using high-information captions labeled by LLaVA.
Stage 3: Enhance image aesthetics using high-quality datasets.
An auto-labeling pipeline with LLaVA is used to create dense, precise captions for the SAM dataset.
Efficiency optimizations like shared normalization parameters (adaLN-single) are incorporated.
Training uses AdamW optimizer with learning rate 2e-5, batch size 64-178, on 64 V100 GPUs.


Decomposing the training strategy into distinct stages (pixel, alignment, aesthetic) significantly improves efficiency.
Using auto-labeled, high-information captions is crucial for fast text-image alignment learning.
Compatibility with pretrained class-condition model weights provides a useful initialization.
Architectural optimizations like cross-attention modules and shared normalization parameters improve efficiency.
The model achieves near state-of-the-art quality with only 2% of the training cost of other models.


PIXART-Ξ± achieves competitive image generation quality to state-of-the-art models while reducing training costs by 98% and CO2 emissions by 90%.

We need the code…


This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

This comment has been hidden
This comment has been hidden

@yujincheng08 Hey, is there a chance of this paper code/weights becoming public?

Hi, the code is released in
And the project page is

The model is amazing

I made a full tutorial

Also opened feature add request on Automatic1111 SD Web UI, Kohya Trainer scripts and OneTrainer

We really need more details about how to train it

My tutorial and auto installers cover on Windows and RunPod / Linux

supports 8 bit Text Encoder loading and CPU off load feature

This model is definitely better than SDXL

PIXART-Ξ± : First Open Source Rival to Midjourney - Better Than Stable Diffusion SDXL - Full Tutorial


Sign up or log in to comment

Models citing this paper 7

Browse 7 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite in a dataset to link it from this page.

Spaces citing this paper 88

Collections including this paper 14