Diffusers documentation

SanaPipeline

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

SanaPipeline

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers from NVIDIA and MIT HAN Lab, by Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, Song Han.

The abstract from the paper is:

We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096×4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. (3) Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. (4) Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024×1024 resolution image. Sana enables content creation at low cost. Code and model will be publicly released.

Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.

This pipeline was contributed by lawrence-cj and chenjy2003. The original codebase can be found here. The original weights can be found under hf.co/Efficient-Large-Model.

Available models:

Model Recommended dtype
Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers torch.bfloat16
Efficient-Large-Model/Sana_1600M_1024px_diffusers torch.float16
Efficient-Large-Model/Sana_1600M_1024px_MultiLing_diffusers torch.float16
Efficient-Large-Model/Sana_1600M_512px_diffusers torch.float16
Efficient-Large-Model/Sana_1600M_512px_MultiLing_diffusers torch.float16
Efficient-Large-Model/Sana_600M_1024px_diffusers torch.float16
Efficient-Large-Model/Sana_600M_512px_diffusers torch.float16

Refer to this collection for more information.

Note: The recommended dtype mentioned is for the transformer weights. The text encoder and VAE weights must stay in torch.bfloat16 or torch.float32 for the model to work correctly. Please refer to the inference example below to see how to load the model with the recommended dtype.

Make sure to pass the variant argument for downloaded checkpoints to use lower disk space. Set it to "fp16" for models with recommended dtype as torch.float16, and "bf16" for models with recommended dtype as torch.bfloat16. By default, torch.float32 weights are downloaded, which use twice the amount of disk storage. Additionally, torch.float32 weights can be downcasted on-the-fly by specifying the torch_dtype argument. Read about it in the docs.

SanaPipeline

class diffusers.SanaPipeline

< >

( tokenizer: AutoTokenizer text_encoder: AutoModelForCausalLM vae: AutoencoderDC transformer: SanaTransformer2DModel scheduler: DPMSolverMultistepScheduler )

Pipeline for text-to-image generation using Sana.

__call__

< >

( prompt: typing.Union[str, typing.List[str]] = None negative_prompt: str = '' num_inference_steps: int = 20 timesteps: typing.List[int] = None sigmas: typing.List[float] = None guidance_scale: float = 4.5 num_images_per_prompt: typing.Optional[int] = 1 height: int = 1024 width: int = 1024 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True clean_caption: bool = True use_resolution_binning: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 300 complex_human_instruction: typing.List[str] = ["Given a user prompt, generate an 'Enhanced prompt' that provides detailed visual descriptions suitable for image generation. Evaluate the level of detail in the user prompt:", '- If the prompt is simple, focus on adding specifics about colors, shapes, sizes, textures, and spatial relationships to create vivid and concrete scenes.', '- If the prompt is already detailed, refine and enhance the existing details slightly without overcomplicating.', 'Here are examples of how to transform or refine prompts:', '- User Prompt: A cat sleeping -> Enhanced: A small, fluffy white cat curled up in a round shape, sleeping peacefully on a warm sunny windowsill, surrounded by pots of blooming red flowers.', '- User Prompt: A busy city street -> Enhanced: A bustling city street scene at dusk, featuring glowing street lamps, a diverse crowd of people in colorful clothing, and a double-decker bus passing by towering glass skyscrapers.', 'Please generate only the enhanced description for the prompt below and avoid including any additional commentary or evaluations:', 'User Prompt: '] ) SanaPipelineOutput or tuple

Parameters

  • prompt (str or List[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to pass prompt_embeds. instead.
  • negative_prompt (str or List[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
  • num_inference_steps (int, optional, defaults to 20) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
  • timesteps (List[int], optional) — Custom timesteps to use for the denoising process with schedulers which support a timesteps argument in their set_timesteps method. If not defined, the default behavior when num_inference_steps is passed will be used. Must be in descending order.
  • sigmas (List[float], optional) — Custom sigmas to use for the denoising process with schedulers which support a sigmas argument in their set_timesteps method. If not defined, the default behavior when num_inference_steps is passed will be used.
  • guidance_scale (float, optional, defaults to 4.5) — Guidance scale as defined in Classifier-Free Diffusion Guidance. guidance_scale is defined as w of equation 2. of Imagen Paper. Guidance scale is enabled by setting guidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality.
  • num_images_per_prompt (int, optional, defaults to 1) — The number of images to generate per prompt.
  • height (int, optional, defaults to self.unet.config.sample_size) — The height in pixels of the generated image.
  • width (int, optional, defaults to self.unet.config.sample_size) — The width in pixels of the generated image.
  • eta (float, optional, defaults to 0.0) — Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to schedulers.DDIMScheduler, will be ignored for others.
  • generator (torch.Generator or List[torch.Generator], optional) — One or a list of torch generator(s) to make generation deterministic.
  • latents (torch.Tensor, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied random generator.
  • prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
  • prompt_attention_mask (torch.Tensor, optional) — Pre-generated attention mask for text embeddings.
  • negative_prompt_embeds (torch.Tensor, optional) — Pre-generated negative text embeddings. For PixArt-Sigma this negative prompt should be "". If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
  • negative_prompt_attention_mask (torch.Tensor, optional) — Pre-generated attention mask for negative text embeddings.
  • output_type (str, optional, defaults to "pil") — The output format of the generate image. Choose between PIL: PIL.Image.Image or np.array.
  • return_dict (bool, optional, defaults to True) — Whether or not to return a ~pipelines.stable_diffusion.IFPipelineOutput instead of a plain tuple.
  • attention_kwargs — A kwargs dictionary that if specified is passed along to the AttentionProcessor as defined under self.processor in diffusers.models.attention_processor.
  • clean_caption (bool, optional, defaults to True) — Whether or not to clean the caption before creating embeddings. Requires beautifulsoup4 and ftfy to be installed. If the dependencies are not installed, the embeddings will be created from the raw prompt.
  • use_resolution_binning (bool defaults to True) — If set to True, the requested height and width are first mapped to the closest resolutions using ASPECT_RATIO_1024_BIN. After the produced latents are decoded into images, they are resized back to the requested resolution. Useful for generating non-square images.
  • callback_on_step_end (Callable, optional) — A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments: callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict). callback_kwargs will include a list of all tensors as specified by callback_on_step_end_tensor_inputs.
  • callback_on_step_end_tensor_inputs (List, optional) — The list of tensor inputs for the callback_on_step_end function. The tensors specified in the list will be passed as callback_kwargs argument. You will only be able to include variables listed in the ._callback_tensor_inputs attribute of your pipeline class.
  • max_sequence_length (int defaults to 300) — Maximum sequence length to use with the prompt.
  • complex_human_instruction (List[str], optional) — Instructions for complex human attention: https://github.com/NVlabs/Sana/blob/main/configs/sana_app_config/Sana_1600M_app.yaml#L55.

Returns

SanaPipelineOutput or tuple

If return_dict is True, SanaPipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images

Function invoked when calling the pipeline for generation.

Examples:

>>> import torch
>>> from diffusers import SanaPipeline

>>> pipe = SanaPipeline.from_pretrained(
...     "Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers", torch_dtype=torch.float32
... )
>>> pipe.to("cuda")
>>> pipe.text_encoder.to(torch.bfloat16)
>>> pipe.transformer = pipe.transformer.to(torch.bfloat16)

>>> image = pipe(prompt='a cyberpunk cat with a neon sign that says "Sana"')[0]
>>> image[0].save("output.png")

encode_prompt

< >

( prompt: typing.Union[str, typing.List[str]] do_classifier_free_guidance: bool = True negative_prompt: str = '' num_images_per_prompt: int = 1 device: typing.Optional[torch.device] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None clean_caption: bool = False max_sequence_length: int = 300 complex_human_instruction: typing.Optional[typing.List[str]] = None lora_scale: typing.Optional[float] = None )

Parameters

  • prompt (str or List[str], optional) — prompt to be encoded
  • negative_prompt (str or List[str], optional) — The prompt not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1). For PixArt-Alpha, this should be "".
  • do_classifier_free_guidance (bool, optional, defaults to True) — whether to use classifier free guidance or not
  • num_images_per_prompt (int, optional, defaults to 1) — number of images that should be generated per prompt
  • device — (torch.device, optional): torch device to place the resulting embeddings on
  • prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
  • negative_prompt_embeds (torch.Tensor, optional) — Pre-generated negative text embeddings. For Sana, it’s should be the embeddings of the "" string.
  • clean_caption (bool, defaults to False) — If True, the function will preprocess and clean the provided caption before encoding.
  • max_sequence_length (int, defaults to 300) — Maximum sequence length to use for the prompt.
  • complex_human_instruction (list[str], defaults to complex_human_instruction) — If complex_human_instruction is not empty, the function will use the complex Human instruction for the prompt.

Encodes the prompt into text encoder hidden states.

SanaPAGPipeline

class diffusers.SanaPAGPipeline

< >

( tokenizer: AutoTokenizer text_encoder: AutoModelForCausalLM vae: AutoencoderDC transformer: SanaTransformer2DModel scheduler: FlowMatchEulerDiscreteScheduler pag_applied_layers: typing.Union[str, typing.List[str]] = 'transformer_blocks.0' )

Pipeline for text-to-image generation using Sana. This pipeline supports the use of Perturbed Attention Guidance (PAG).

__call__

< >

( prompt: typing.Union[str, typing.List[str]] = None negative_prompt: str = '' num_inference_steps: int = 20 timesteps: typing.List[int] = None sigmas: typing.List[float] = None guidance_scale: float = 4.5 num_images_per_prompt: typing.Optional[int] = 1 height: int = 1024 width: int = 1024 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True clean_caption: bool = True use_resolution_binning: bool = True callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 300 complex_human_instruction: typing.List[str] = ["Given a user prompt, generate an 'Enhanced prompt' that provides detailed visual descriptions suitable for image generation. Evaluate the level of detail in the user prompt:", '- If the prompt is simple, focus on adding specifics about colors, shapes, sizes, textures, and spatial relationships to create vivid and concrete scenes.', '- If the prompt is already detailed, refine and enhance the existing details slightly without overcomplicating.', 'Here are examples of how to transform or refine prompts:', '- User Prompt: A cat sleeping -> Enhanced: A small, fluffy white cat curled up in a round shape, sleeping peacefully on a warm sunny windowsill, surrounded by pots of blooming red flowers.', '- User Prompt: A busy city street -> Enhanced: A bustling city street scene at dusk, featuring glowing street lamps, a diverse crowd of people in colorful clothing, and a double-decker bus passing by towering glass skyscrapers.', 'Please generate only the enhanced description for the prompt below and avoid including any additional commentary or evaluations:', 'User Prompt: '] pag_scale: float = 3.0 pag_adaptive_scale: float = 0.0 ) ImagePipelineOutput or tuple

Parameters

  • prompt (str or List[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to pass prompt_embeds. instead.
  • negative_prompt (str or List[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
  • num_inference_steps (int, optional, defaults to 20) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
  • timesteps (List[int], optional) — Custom timesteps to use for the denoising process with schedulers which support a timesteps argument in their set_timesteps method. If not defined, the default behavior when num_inference_steps is passed will be used. Must be in descending order.
  • sigmas (List[float], optional) — Custom sigmas to use for the denoising process with schedulers which support a sigmas argument in their set_timesteps method. If not defined, the default behavior when num_inference_steps is passed will be used.
  • guidance_scale (float, optional, defaults to 4.5) — Guidance scale as defined in Classifier-Free Diffusion Guidance. guidance_scale is defined as w of equation 2. of Imagen Paper. Guidance scale is enabled by setting guidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality.
  • num_images_per_prompt (int, optional, defaults to 1) — The number of images to generate per prompt.
  • height (int, optional, defaults to self.unet.config.sample_size) — The height in pixels of the generated image.
  • width (int, optional, defaults to self.unet.config.sample_size) — The width in pixels of the generated image.
  • eta (float, optional, defaults to 0.0) — Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to schedulers.DDIMScheduler, will be ignored for others.
  • generator (torch.Generator or List[torch.Generator], optional) — One or a list of torch generator(s) to make generation deterministic.
  • latents (torch.Tensor, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied random generator.
  • prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
  • prompt_attention_mask (torch.Tensor, optional) — Pre-generated attention mask for text embeddings.
  • negative_prompt_embeds (torch.Tensor, optional) — Pre-generated negative text embeddings. For PixArt-Sigma this negative prompt should be "". If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
  • negative_prompt_attention_mask (torch.Tensor, optional) — Pre-generated attention mask for negative text embeddings.
  • output_type (str, optional, defaults to "pil") — The output format of the generate image. Choose between PIL: PIL.Image.Image or np.array.
  • return_dict (bool, optional, defaults to True) — Whether or not to return a ~pipelines.stable_diffusion.IFPipelineOutput instead of a plain tuple.
  • clean_caption (bool, optional, defaults to True) — Whether or not to clean the caption before creating embeddings. Requires beautifulsoup4 and ftfy to be installed. If the dependencies are not installed, the embeddings will be created from the raw prompt.
  • use_resolution_binning (bool defaults to True) — If set to True, the requested height and width are first mapped to the closest resolutions using ASPECT_RATIO_1024_BIN. After the produced latents are decoded into images, they are resized back to the requested resolution. Useful for generating non-square images.
  • callback_on_step_end (Callable, optional) — A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments: callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict). callback_kwargs will include a list of all tensors as specified by callback_on_step_end_tensor_inputs.
  • callback_on_step_end_tensor_inputs (List, optional) — The list of tensor inputs for the callback_on_step_end function. The tensors specified in the list will be passed as callback_kwargs argument. You will only be able to include variables listed in the ._callback_tensor_inputs attribute of your pipeline class.
  • max_sequence_length (int defaults to 300) — Maximum sequence length to use with the prompt.
  • complex_human_instruction (List[str], optional) — Instructions for complex human attention: https://github.com/NVlabs/Sana/blob/main/configs/sana_app_config/Sana_1600M_app.yaml#L55.
  • pag_scale (float, optional, defaults to 3.0) — The scale factor for the perturbed attention guidance. If it is set to 0.0, the perturbed attention guidance will not be used.
  • pag_adaptive_scale (float, optional, defaults to 0.0) — The adaptive scale factor for the perturbed attention guidance. If it is set to 0.0, pag_scale is used.

Returns

ImagePipelineOutput or tuple

If return_dict is True, ImagePipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images

Function invoked when calling the pipeline for generation.

Examples:

>>> import torch
>>> from diffusers import SanaPAGPipeline

>>> pipe = SanaPAGPipeline.from_pretrained(
...     "Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers",
...     pag_applied_layers=["transformer_blocks.8"],
...     torch_dtype=torch.float32,
... )
>>> pipe.to("cuda")
>>> pipe.text_encoder.to(torch.bfloat16)
>>> pipe.transformer = pipe.transformer.to(torch.bfloat16)

>>> image = pipe(prompt='a cyberpunk cat with a neon sign that says "Sana"')[0]
>>> image[0].save("output.png")

encode_prompt

< >

( prompt: typing.Union[str, typing.List[str]] do_classifier_free_guidance: bool = True negative_prompt: str = '' num_images_per_prompt: int = 1 device: typing.Optional[torch.device] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None clean_caption: bool = False max_sequence_length: int = 300 complex_human_instruction: typing.Optional[typing.List[str]] = None )

Parameters

  • prompt (str or List[str], optional) — prompt to be encoded
  • negative_prompt (str or List[str], optional) — The prompt not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1). For PixArt-Alpha, this should be "".
  • do_classifier_free_guidance (bool, optional, defaults to True) — whether to use classifier free guidance or not
  • num_images_per_prompt (int, optional, defaults to 1) — number of images that should be generated per prompt
  • device — (torch.device, optional): torch device to place the resulting embeddings on
  • prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
  • negative_prompt_embeds (torch.Tensor, optional) — Pre-generated negative text embeddings. For Sana, it’s should be the embeddings of the "" string.
  • clean_caption (bool, defaults to False) — If True, the function will preprocess and clean the provided caption before encoding.
  • max_sequence_length (int, defaults to 300) — Maximum sequence length to use for the prompt.
  • complex_human_instruction (list[str], defaults to complex_human_instruction) — If complex_human_instruction is not empty, the function will use the complex Human instruction for the prompt.

Encodes the prompt into text encoder hidden states.

SanaPipelineOutput

class diffusers.pipelines.sana.pipeline_output.SanaPipelineOutput

< >

( images: typing.Union[typing.List[PIL.Image.Image], numpy.ndarray] )

Parameters

  • images (List[PIL.Image.Image] or np.ndarray) — List of denoised PIL images of length batch_size or numpy array of shape (batch_size, height, width, num_channels). PIL images or numpy array present the denoised images of the diffusion pipeline.

Output class for Sana pipelines.

< > Update on GitHub