PixArt-α

PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis is Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li.

The abstract from the paper is:

The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PIXART-α, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution image synthesis up to 1024px resolution with low training cost, as shown in Figure 1 and 2. To achieve this goal, three core designs are proposed: (1) Training strategy decomposition: We devise three distinct training steps that separately optimize pixel dependency, text-image alignment, and image aesthetic quality; (2) Efficient T2I Transformer: We incorporate cross-attention modules into Diffusion Transformer (DiT) to inject text conditions and streamline the computation-intensive class-condition branch; (3) High-informative data: We emphasize the significance of concept density in text-image pairs and leverage a large Vision-Language model to auto-label dense pseudo-captions to assist text-image alignment learning. As a result, PIXART-α’s training speed markedly surpasses existing large-scale T2I models, e.g., PIXART-α only takes 10.8% of Stable Diffusion v1.5’s training time (675 vs. 6,250 A100 GPU days), saving nearly $300,000 ($26,000 vs. $320,000) and reducing 90% CO2 emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training cost is merely 1%. Extensive experiments demonstrate that PIXART-α excels in image quality, artistry, and semantic control. We hope PIXART-α will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.

You can find the original codebase at PixArt-alpha/PixArt-alpha and all the available checkpoints at PixArt-alpha.

Some notes about this pipeline:

It uses a Transformer backbone (instead of a UNet) for denoising. As such it has a similar architecture as DiT.
It was trained using text conditions computed from T5. This aspect makes the pipeline better at following complex text prompts with intricate details.
It is good at producing high-resolution images at different aspect ratios. To get the best results, the authors recommend some size brackets which can be found here.
It rivals the quality of state-of-the-art text-to-image generation systems (as of this writing) such as Stable Diffusion XL, Imagen, and DALL-E 2, while being more efficient than them.

Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.

Inference with under 8GB GPU VRAM

Run the PixArtAlphaPipeline with under 8GB GPU VRAM by loading the text encoder in 8-bit precision. Let’s walk through a full-fledged example.

First, install the bitsandbytes library:

pip install -U bitsandbytes

Then load the text encoder in 8-bit:

from transformers import T5EncoderModel
from diffusers import PixArtAlphaPipeline
import torch

text_encoder = T5EncoderModel.from_pretrained(
    "PixArt-alpha/PixArt-XL-2-1024-MS",
    subfolder="text_encoder",
    load_in_8bit=True,
    device_map="auto",

)
pipe = PixArtAlphaPipeline.from_pretrained(
    "PixArt-alpha/PixArt-XL-2-1024-MS",
    text_encoder=text_encoder,
    transformer=None,
    device_map="auto"
)

Now, use the pipe to encode a prompt:

with torch.no_grad():
    prompt = "cute cat"
    prompt_embeds, prompt_attention_mask, negative_embeds, negative_prompt_attention_mask = pipe.encode_prompt(prompt)

Since text embeddings have been computed, remove the text_encoder and pipe from the memory, and free up some GPU VRAM:

import gc

def flush():
    gc.collect()
    torch.cuda.empty_cache()

del text_encoder
del pipe
flush()

Then compute the latents with the prompt embeddings as inputs:

pipe = PixArtAlphaPipeline.from_pretrained(
    "PixArt-alpha/PixArt-XL-2-1024-MS",
    text_encoder=None,
    torch_dtype=torch.float16,
).to("cuda")

latents = pipe(
    negative_prompt=None,
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_embeds,
    prompt_attention_mask=prompt_attention_mask,
    negative_prompt_attention_mask=negative_prompt_attention_mask,
    num_images_per_prompt=1,
    output_type="latent",
).images

del pipe.transformer
flush()

Notice that while initializing pipe, you’re setting text_encoder to None so that it’s not loaded.

Once the latents are computed, pass it off to the VAE to decode into a real image:

with torch.no_grad():
    image = pipe.vae.decode(latents / pipe.vae.config.scaling_factor, return_dict=False)[0]
image = pipe.image_processor.postprocess(image, output_type="pil")[0]
image.save("cat.png")

By deleting components you aren’t using and flushing the GPU VRAM, you should be able to run PixArtAlphaPipeline with under 8GB GPU VRAM.

If you want a report of your memory-usage, run this script.

Text embeddings computed in 8-bit can impact the quality of the generated images because of the information loss in the representation space caused by the reduced precision. It’s recommended to compare the outputs with and without 8-bit.

While loading the text_encoder, you set load_in_8bit to True. You could also specify load_in_4bit to bring your memory requirements down even further to under 7GB.

PixArtAlphaPipeline

class diffusers.PixArtAlphaPipeline

< source >

( tokenizer: T5Tokenizer text_encoder: T5EncoderModel vae: AutoencoderKL transformer: PixArtTransformer2DModel scheduler: DPMSolverMultistepScheduler )

Parameters

vae (AutoencoderKL) — Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
text_encoder (T5EncoderModel) — Frozen text-encoder. PixArt-Alpha uses T5, specifically the t5-v1_1-xxl variant.
tokenizer (T5Tokenizer) — Tokenizer of class T5Tokenizer.
transformer (PixArtTransformer2DModel) — A text conditioned PixArtTransformer2DModel to denoise the encoded image latents. Initially published as Transformer2DModel in the config, but the mismatch can be ignored.
scheduler (SchedulerMixin) — A scheduler to be used in combination with transformer to denoise the encoded image latents.

Pipeline for text-to-image generation using PixArt-Alpha.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

call

< source >

( prompt: typing.Union[str, typing.List[str]] = None negative_prompt: str = '' num_inference_steps: int = 20 timesteps: typing.List[int] = None sigmas: typing.List[float] = None guidance_scale: float = 4.5 num_images_per_prompt: typing.Optional[int] = 1 height: typing.Optional[int] = None width: typing.Optional[int] = None eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 clean_caption: bool = True use_resolution_binning: bool = True max_sequence_length: int = 120 **kwargs ) → ImagePipelineOutput or tuple

Parameters

prompt (str or List[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to pass prompt_embeds. instead.
negative_prompt (str or List[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
num_inference_steps (int, optional, defaults to 100) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
timesteps (List[int], optional) — Custom timesteps to use for the denoising process with schedulers which support a timesteps argument in their set_timesteps method. If not defined, the default behavior when num_inference_steps is passed will be used. Must be in descending order.
sigmas (List[float], optional) — Custom sigmas to use for the denoising process with schedulers which support a sigmas argument in their set_timesteps method. If not defined, the default behavior when num_inference_steps is passed will be used.
guidance_scale (float, optional, defaults to 4.5) — Guidance scale as defined in Classifier-Free Diffusion Guidance. guidance_scale is defined as w of equation 2. of Imagen Paper. Guidance scale is enabled by setting guidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality.
num_images_per_prompt (int, optional, defaults to 1) — The number of images to generate per prompt.
height (int, optional, defaults to self.unet.config.sample_size) — The height in pixels of the generated image.
width (int, optional, defaults to self.unet.config.sample_size) — The width in pixels of the generated image.
eta (float, optional, defaults to 0.0) — Corresponds to parameter eta (η) in the DDIM paper: https://huggingface.co/papers/2010.02502. Only applies to schedulers.DDIMScheduler, will be ignored for others.
generator (torch.Generator or List[torch.Generator], optional) — One or a list of torch generator(s) to make generation deterministic.
latents (torch.Tensor, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will be generated by sampling using the supplied random generator.
prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
prompt_attention_mask (torch.Tensor, optional) — Pre-generated attention mask for text embeddings.
negative_prompt_embeds (torch.Tensor, optional) — Pre-generated negative text embeddings. For PixArt-Alpha this negative prompt should be "". If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
negative_prompt_attention_mask (torch.Tensor, optional) — Pre-generated attention mask for negative text embeddings.
output_type (str, optional, defaults to "pil") — The output format of the generate image. Choose between PIL: PIL.Image.Image or np.array.
return_dict (bool, optional, defaults to True) — Whether or not to return a ~pipelines.stable_diffusion.IFPipelineOutput instead of a plain tuple.
callback (Callable, optional) — A function that will be called every callback_steps steps during inference. The function will be called with the following arguments: callback(step: int, timestep: int, latents: torch.Tensor).
callback_steps (int, optional, defaults to 1) — The frequency at which the callback function will be called. If not specified, the callback will be called at every step.
clean_caption (bool, optional, defaults to True) — Whether or not to clean the caption before creating embeddings. Requires beautifulsoup4 and ftfy to be installed. If the dependencies are not installed, the embeddings will be created from the raw prompt.
use_resolution_binning (bool defaults to True) — If set to True, the requested height and width are first mapped to the closest resolutions using ASPECT_RATIO_1024_BIN. After the produced latents are decoded into images, they are resized back to the requested resolution. Useful for generating non-square images.
max_sequence_length (int defaults to 120) — Maximum sequence length to use with the prompt.

Returns

ImagePipelineOutput or tuple

If return_dict is True, ImagePipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images

Function invoked when calling the pipeline for generation.

Examples:

>>> import torch
>>> from diffusers import PixArtAlphaPipeline

>>> # You can replace the checkpoint id with "PixArt-alpha/PixArt-XL-2-512x512" too.
>>> pipe = PixArtAlphaPipeline.from_pretrained("PixArt-alpha/PixArt-XL-2-1024-MS", torch_dtype=torch.float16)
>>> # Enable memory optimizations.
>>> pipe.enable_model_cpu_offload()

>>> prompt = "A small cactus with a happy face in the Sahara desert."
>>> image = pipe(prompt).images[0]

encode_prompt

< source >

( prompt: typing.Union[str, typing.List[str]] do_classifier_free_guidance: bool = True negative_prompt: str = '' num_images_per_prompt: int = 1 device: typing.Optional[torch.device] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None clean_caption: bool = False max_sequence_length: int = 120 **kwargs )

Parameters

prompt (str or List[str], optional) — prompt to be encoded
negative_prompt (str or List[str], optional) — The prompt not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1). For PixArt-Alpha, this should be "".
do_classifier_free_guidance (bool, optional, defaults to True) — whether to use classifier free guidance or not
num_images_per_prompt (int, optional, defaults to 1) — number of images that should be generated per prompt
device — (torch.device, optional): torch device to place the resulting embeddings on
prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
negative_prompt_embeds (torch.Tensor, optional) — Pre-generated negative text embeddings. For PixArt-Alpha, it’s should be the embeddings of the "" string.
clean_caption (bool, defaults to False) — If True, the function will preprocess and clean the provided caption before encoding.
max_sequence_length (int, defaults to 120) — Maximum sequence length to use for the prompt.

Encodes the prompt into text encoder hidden states.

Update on GitHub

Diffusers

PixArt-α

Inference with under 8GB GPU VRAM

PixArtAlphaPipeline

class diffusers.PixArtAlphaPipeline

__call__

encode_prompt

call