metadata
library_name: diffusers
tags:
- modular-diffusers
- diffusers
- qwenimage-layered
- text-to-image
This is a modular diffusion pipeline built with 🧨 Diffusers' modular pipeline framework.
Pipeline Type: QwenImageLayeredAutoBlocks
Description: Auto Modular pipeline for layered denoising tasks using QwenImage-Layered.
This pipeline uses a 4-block architecture that can be customized and extended.
Example Usage
[TODO]
Pipeline Architecture
This modular pipeline is composed of the following blocks:
- text_encoder (
QwenImageLayeredTextEncoderStep)- QwenImage-Layered Text encoder step that encode the text prompt, will generate a prompt based on image if not provided.
- resize:
QwenImageLayeredResizeStep- Image Resize step that resize the image to a target area (defined by the resolution parameter from user) while maintaining the aspect ratio.
- get_image_prompt:
QwenImageLayeredGetImagePromptStep- Auto-caption step that generates a text prompt from the input image if none is provided.
- encode:
QwenImageTextEncoderStep- Text Encoder step that generates text embeddings to guide the image generation.
- vae_encoder (
QwenImageLayeredVaeEncoderStep)- Vae encoder step that encode the image inputs into their latent representations.
- resize:
QwenImageLayeredResizeStep- Image Resize step that resize the image to a target area (defined by the resolution parameter from user) while maintaining the aspect ratio.
- preprocess:
QwenImageEditProcessImagesInputStep- Image Preprocess step. Images needs to be resized first.
- encode:
QwenImageVaeEncoderStep- VAE Encoder step that converts processed_image into latent representations image_latents.
- permute:
QwenImageLayeredPermuteLatentsStep- Permute image latents from (B, C, 1, H, W) to (B, 1, C, H, W) for Layered packing.
- denoise (
QwenImageLayeredCoreDenoiseStep)- Core denoising workflow for QwenImage-Layered img2img task.
- input:
QwenImageLayeredInputStep- Input step that prepares the inputs for the layered denoising step. It:
- prepare_latents:
QwenImageLayeredPrepareLatentsStep- Prepare initial random noise (B, layers+1, C, H, W) for the generation process
- set_timesteps:
QwenImageLayeredSetTimestepsStep- Set timesteps step for QwenImage Layered with custom mu calculation based on image_latents.
- prepare_rope_inputs:
QwenImageLayeredRoPEInputsStep- Step that prepares the RoPE inputs for the denoising process. Should be place after prepare_latents step
- denoise:
QwenImageLayeredDenoiseStep- Denoise step that iteratively denoise the latents.
- after_denoise:
QwenImageLayeredAfterDenoiseStep- Unpack latents from (B, seq, C*4) to (B, C, layers+1, H, W) after denoising.
- decode (
QwenImageLayeredDecoderStep)- Decode unpacked latents (B, C, layers+1, H, W) into layer images.
Model Components
- image_resize_processor (
VaeImageProcessor) - text_encoder (
Qwen2_5_VLForConditionalGeneration) - processor (
Qwen2VLProcessor) - tokenizer (
Qwen2Tokenizer): The tokenizer to use - guider (
ClassifierFreeGuidance) - image_processor (
VaeImageProcessor) - vae (
AutoencoderKLQwenImage) - pachifier (
QwenImageLayeredPachifier) - scheduler (
FlowMatchEulerDiscreteScheduler) - transformer (
QwenImageTransformer2DModel) ## Input/Output Specification
Inputs:
image(Image | list): Reference image(s) for denoising. Can be a single image or list of images.resolution(int, optional, defaults to640): The target area to resize the image to, can be 1024 or 640prompt(str, optional): The prompt or prompts to guide image generation.use_en_prompt(bool, optional, defaults toFalse): Whether to use English prompt templatenegative_prompt(str, optional): The prompt or prompts not to guide the image generation.max_sequence_length(int, optional, defaults to1024): Maximum sequence length for prompt encoding.generator(Generator, optional): Torch generator for deterministic generation.num_images_per_prompt(int, optional, defaults to1): The number of images to generate per prompt.latents(Tensor, optional): Pre-generated noisy latents for image generation.layers(int, optional, defaults to4): Number of layers to extract from the imagenum_inference_steps(int, optional, defaults to50): The number of denoising steps.sigmas(list, optional): Custom sigmas for the denoising process.attention_kwargs(dict, optional): Additional kwargs for attention processors.**denoiser_input_fields(None, optional): conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.output_type(str, optional, defaults topil): Output format: 'pil', 'np', 'pt'.
Outputs:
images(list): Generated images.