PixArt
PixArt-Ξ±: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis is Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li.
The abstract from the paper is:
The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PIXART-Ξ±, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution image synthesis up to 1024px resolution with low training cost, as shown in Figure 1 and 2. To achieve this goal, three core designs are proposed: (1) Training strategy decomposition: We devise three distinct training steps that separately optimize pixel dependency, text-image alignment, and image aesthetic quality; (2) Efficient T2I Transformer: We incorporate cross-attention modules into Diffusion Transformer (DiT) to inject text conditions and streamline the computation-intensive class-condition branch; (3) High-informative data: We emphasize the significance of concept density in text-image pairs and leverage a large Vision-Language model to auto-label dense pseudo-captions to assist text-image alignment learning. As a result, PIXART-Ξ±βs training speed markedly surpasses existing large-scale T2I models, e.g., PIXART-Ξ± only takes 10.8% of Stable Diffusion v1.5βs training time (675 vs. 6,250 A100 GPU days), saving nearly $300,000 ($26,000 vs. $320,000) and reducing 90% CO2 emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training cost is merely 1%. Extensive experiments demonstrate that PIXART-Ξ± excels in image quality, artistry, and semantic control. We hope PIXART-Ξ± will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.
You can find the original codebase at PixArt-alpha/PixArt-alpha and all the available checkpoints at PixArt-alpha.
Some notes about this pipeline:
- It uses a Transformer backbone (instead of a UNet) for denoising. As such it has a similar architecture as DiT.
- It was trained using text conditions computed from T5. This aspect makes the pipeline better at following complex text prompts with intricate details.
- It is good at producing high-resolution images at different aspect ratios. To get the best results, the authors recommend some size brackets which can be found here.
- It rivals the quality of state-of-the-art text-to-image generation systems (as of this writing) such as Stable Diffusion XL, Imagen, and DALL-E 2, while being more efficient than them.
PixArtAlphaPipeline
class diffusers.PixArtAlphaPipeline
< source >( tokenizer: T5Tokenizer text_encoder: T5EncoderModel vae: AutoencoderKL transformer: Transformer2DModel scheduler: DPMSolverMultistepScheduler )
Parameters
- vae (AutoencoderKL) — Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
- text_encoder (
T5EncoderModel
) — Frozen text-encoder. PixArt-Alpha uses T5, specifically the t5-v1_1-xxl variant. - tokenizer (
T5Tokenizer
) — Tokenizer of class T5Tokenizer. - transformer (Transformer2DModel) —
A text conditioned
Transformer2DModel
to denoise the encoded image latents. - scheduler (SchedulerMixin) —
A scheduler to be used in combination with
transformer
to denoise the encoded image latents.
Pipeline for text-to-image generation using PixArt-Alpha.
This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
__call__
< source >( prompt: typing.Union[str, typing.List[str]] = None negative_prompt: str = '' num_inference_steps: int = 20 timesteps: typing.List[int] = None guidance_scale: float = 4.5 num_images_per_prompt: typing.Optional[int] = 1 height: typing.Optional[int] = None width: typing.Optional[int] = None eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.FloatTensor] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Union[typing.Callable[[int, int, torch.FloatTensor], NoneType], NoneType] = None callback_steps: int = 1 clean_caption: bool = True mask_feature: bool = True ) β ImagePipelineOutput or tuple
Parameters
- prompt (
str
orList[str]
, optional) — The prompt or prompts to guide the image generation. If not defined, one has to passprompt_embeds
. instead. - negative_prompt (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds
instead. Ignored when not using guidance (i.e., ignored ifguidance_scale
is less than1
). - num_inference_steps (
int
, optional, defaults to 100) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - timesteps (
List[int]
, optional) — Custom timesteps to use for the denoising process. If not defined, equal spacednum_inference_steps
timesteps are used. Must be in descending order. - guidance_scale (
float
, optional, defaults to 7.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scale
is defined asw
of equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1
. Higher guidance scale encourages to generate images that are closely linked to the textprompt
, usually at the expense of lower image quality. - num_images_per_prompt (
int
, optional, defaults to 1) — The number of images to generate per prompt. - height (
int
, optional, defaults to self.unet.config.sample_size) — The height in pixels of the generated image. - width (
int
, optional, defaults to self.unet.config.sample_size) — The width in pixels of the generated image. - eta (
float
, optional, defaults to 0.0) — Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to schedulers.DDIMScheduler, will be ignored for others. - generator (
torch.Generator
orList[torch.Generator]
, optional) — One or a list of torch generator(s) to make generation deterministic. - latents (
torch.FloatTensor
, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied randomgenerator
. - prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated fromprompt
input argument. - negative_prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated negative text embeddings. For PixArt-Alpha this negative prompt should be "". If not provided, negative_prompt_embeds will be generated fromnegative_prompt
input argument. - output_type (
str
, optional, defaults to"pil"
) — The output format of the generate image. Choose between PIL:PIL.Image.Image
ornp.array
. - return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return a~pipelines.stable_diffusion.IFPipelineOutput
instead of a plain tuple. - callback (
Callable
, optional) — A function that will be called everycallback_steps
steps during inference. The function will be called with the following arguments:callback(step: int, timestep: int, latents: torch.FloatTensor)
. - callback_steps (
int
, optional, defaults to 1) — The frequency at which thecallback
function will be called. If not specified, the callback will be called at every step. - clean_caption (
bool
, optional, defaults toTrue
) — Whether or not to clean the caption before creating embeddings. Requiresbeautifulsoup4
andftfy
to be installed. If the dependencies are not installed, the embeddings will be created from the raw prompt. - mask_feature (
bool
defaults toTrue
) — If set toTrue
, the text embeddings will be masked.
Returns
ImagePipelineOutput or tuple
If return_dict
is True
, ImagePipelineOutput is returned, otherwise a tuple
is
returned where the first element is a list with the generated images
Function invoked when calling the pipeline for generation.
Examples:
>>> import torch
>>> from diffusers import PixArtAlphaPipeline
>>> # You can replace the checkpoint id with "PixArt-alpha/PixArt-XL-2-512x512" too.
>>> pipe = PixArtAlphaPipeline.from_pretrained("PixArt-alpha/PixArt-XL-2-1024-MS", torch_dtype=torch.float16)
>>> # Enable memory optimizations.
>>> pipe.enable_model_cpu_offload()
>>> prompt = "A small cactus with a happy face in the Sahara desert."
>>> image = pipe(prompt).images[0]
encode_prompt
< source >( prompt: typing.Union[str, typing.List[str]] do_classifier_free_guidance: bool = True negative_prompt: str = '' num_images_per_prompt: int = 1 device: typing.Optional[torch.device] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None clean_caption: bool = False mask_feature: bool = True )
Parameters
- prompt (
str
orList[str]
, optional) — prompt to be encoded - negative_prompt (
str
orList[str]
, optional) — The prompt not to guide the image generation. If not defined, one has to passnegative_prompt_embeds
instead. Ignored when not using guidance (i.e., ignored ifguidance_scale
is less than1
). For PixArt-Alpha, this should be "". - do_classifier_free_guidance (
bool
, optional, defaults toTrue
) — whether to use classifier free guidance or not - num_images_per_prompt (
int
, optional, defaults to 1) — number of images that should be generated per prompt device — (torch.device
, optional): torch device to place the resulting embeddings on - prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated fromprompt
input argument. - negative_prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated negative text embeddings. For PixArt-Alpha, it’s should be the embeddings of the "" string. - clean_caption (bool, defaults to
False
) — IfTrue
, the function will preprocess and clean the provided caption before encoding. mask_feature — (bool, defaults toTrue
): IfTrue
, the function will mask the text embeddings.
Encodes the prompt into text encoder hidden states.