YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

HF Diffusers Deconstruct Core99

Code Flow Reference β€” a step-by-step cheat sheet for understanding how a Stable Diffusion pipeline is built from raw diffusers components.

Every tutorial below maps 1-to-1 to a script in this repo and follows the exact execution order of the code. Read top-to-bottom and you read the pipeline.

Tutorials

# Script Technique The "hack" lives in
01 blended_loop.py Blended Latent Diffusion β€” mask-guided regional editing Latents (spatial blend per step)
02 concept_erasure.py Concept Erasure β€” prevent the model from hallucinating banned concepts Noise prediction (extra negative-CFG terms)
03 inversion_implemention.py DDIM Inversion β€” recover the exact noise that reconstructs a real image The scheduler (running it backwards)
04 prompt_to_prompt_attention.py Prompt-to-Prompt β€” edit an image by swapping cross-attention maps The UNet (custom AttnProcessor hook)

Shared Pipeline at a glance

__init__  β†’  encode_to_latent  β†’  build_text_embeddings  β†’  denoising_loop  β†’  decode_from_latent
Component Role Class
VAE Image ↔ latent (4Γ—64Γ—64) AutoencoderKL
Tokenizer Text β†’ token IDs CLIPTokenizer
Text Encoder Token IDs β†’ embeddings [B, 77, 768] CLIPTextModel
UNet Predict noise at timestep t UNet2DConditionModel
Scheduler Manage noise schedule + sampling DPMSolverMultistepScheduler

πŸ”‘ Two invariants used by every script in this repo:

  1. VAE scaling: multiply latents by 0.18215 after encode, divide by 0.18215 before decode.
  2. Mask/Latent grid: 512 / 8 = 64, so masks and noise live on a 64Γ—64 grid.

Tutorial 01 β€” Blended Latent Diffusion (blended_loop.py)

Step 1 β€” Initialization (__init__)

Logic. We do not use StableDiffusionPipeline. Each weight set is loaded individually via from_pretrained(..., subfolder=...), moved to GPU, and cast to float16 for ~2Γ— speed and ~50% VRAM savings. eval() disables dropout. The scheduler chosen here is DPM-Solver++ with Karras sigmas β€” it converges in far fewer steps than DDIM.

model_name = "CompVis/stable-diffusion-v1-4"
text_model_name = "openai/clip-vit-large-patch14"

self.device = "cuda" if torch.cuda.is_available() else "cpu"
self.dtype  = torch.float16 if self.device == "cuda" else torch.float32

# Load components individually β€” no high-level pipeline wrapper
self.autoencoder  = diffusers.AutoencoderKL.from_pretrained(model_name, subfolder="vae")
self.text_encoder = CLIPTextModel.from_pretrained(text_model_name)
self.tokenizer    = CLIPTokenizer.from_pretrained(text_model_name)
self.unet         = diffusers.UNet2DConditionModel.from_pretrained(model_name, subfolder="unet")

self.scheduler = diffusers.DPMSolverMultistepScheduler.from_pretrained(
    model_name, subfolder="scheduler",
    algorithm_type="dpmsolver++", use_karras_sigmas=True,
)

# Cast + move + freeze
self.autoencoder.to(device=self.device, dtype=self.dtype).eval()
self.text_encoder.to(device=self.device, dtype=self.dtype).eval()
self.unet.to(device=self.device, dtype=self.dtype).eval()

πŸ”‘ Cheat sheet: Whatever pipeline you build, it is always these 5 components + their dtype/device contract.

Step 2 β€” Latent Encoding (encode_to_latent & preprocess_mask)

Logic. Stable Diffusion does not denoise pixels β€” it denoises a 4Γ—64Γ—64 latent. The VAE encoder shrinks 512Γ—512Γ—3 β†’ 4Γ—64Γ—64, and outputs are scaled by the magic constant 0.18215 so that latents have roughly unit variance (this is what the UNet was trained on). The mask must be downsampled to the same 64Γ—64 latent grid and broadcast across the 4 channels.

def encode_to_latent(self, init_image):
    preprocess = transforms.Compose([
        transforms.Resize((512, 512)),
        transforms.ToTensor(),
        transforms.Normalize([0.5], [0.5]),   # β†’ range [-1, 1]
    ])
    input_tensor = preprocess(init_image).unsqueeze(0).to(self.device, dtype=self.dtype)

    with torch.no_grad():
        latents = self.autoencoder.encode(input_tensor).latent_dist.sample()

    return latents * 0.18215      # πŸ”‘ VAE scaling factor β€” DO NOT FORGET
def preprocess_mask(self, mask_image):
    # Latent grid is 512 / 8 = 64
    mask = mask_image.resize((64, 64), resample=PIL.Image.NEAREST)
    mask = transforms.ToTensor()(mask).to(self.device, dtype=self.dtype)   # [1, 64, 64]
    return mask.unsqueeze(0)      # [1, 1, 64, 64] β€” broadcasts over 4 latent channels

πŸ”‘ Cheat sheet:

  • NEAREST resampling β€” never BILINEAR for masks (it would leak grey edges).
  • unsqueeze(0) gives shape [1, 1, 64, 64] β†’ broadcasts cleanly against [1, 4, 64, 64] latents.
  • * 0.18215 on encode, / 0.18215 on decode.

Step 3 β€” Text & Noise Prep (generate_noise_from_prompt)

Logic. Classifier-Free Guidance (CFG) requires two forward passes per step: one with the prompt, one with an empty prompt. The trick is to do both in a single batched UNet call by concatenating embeddings along the batch axis β†’ shape [2, 77, 768].

# 1. Conditional (positive) prompt
text_inputs = self.tokenizer(prompt, padding="max_length",
                             max_length=self.tokenizer.model_max_length, return_tensors="pt")
text_embeddings = self.text_encoder(text_inputs.input_ids.to(self.device)).last_hidden_state

# 2. Unconditional (empty) prompt β€” drives negative guidance
uncond_inputs = self.tokenizer("", padding="max_length",
                               max_length=self.tokenizer.model_max_length, return_tensors="pt")
uncond_embeddings = self.text_encoder(uncond_inputs.input_ids.to(self.device)).last_hidden_state

# 3. Stack them β†’ one UNet call handles both branches in parallel
text_embeddings = torch.cat([uncond_embeddings, text_embeddings])   # [2, 77, 768]

# 4. Base Gaussian noise β€” same shape as the latent
init_noise = torch.randn(latent_shape, device=self.device, dtype=self.dtype)

πŸ”‘ Cheat sheet: Order matters β†’ [uncond, cond]. You'll chunk(2) in the same order during the loop.

Step 4 β€” The Core Hack: Denoising Loop (blend_latent_with_mask)

Logic. At every timestep we re-noise the clean original latent up to the current t and use it as the background. The masked foreground is the latent actively being denoised. Spatial blending at every step is what keeps the unmasked region pixel-faithful to the original.

4a. Slice the schedule via strength

self.scheduler.set_timesteps(num_inference_steps, device=self.device)

init_timestep_idx = int(num_inference_steps * (1 - strength))
timesteps = self.scheduler.timesteps[init_timestep_idx:]

if hasattr(self.scheduler, "set_begin_index"):
    self.scheduler.set_begin_index(init_timestep_idx)

4b. Initialize foreground noise

# πŸ”‘ FIX: wrap scalar timestep as a 1-D tensor or DPM++ throws IndexError
start_t = timesteps[0].item()
start_timestep_tensor = torch.tensor([start_t], device=self.device, dtype=torch.long)

latents_fg = self.scheduler.add_noise(latent_init, init_noise, start_timestep_tensor)

# Background noise vector β€” sampled ONCE, reused every step for trajectory consistency
fresh_bg_noise = torch.randn_like(latent_init)

4c. The loop β€” re-noise BG, blend, predict, CFG, step

for idx, t in enumerate(timesteps):
    t_tensor = torch.tensor([t.item()], device=self.device, dtype=torch.long)

    # (A) Re-noise the ORIGINAL clean latent up to current t β†’ background
    latents_bg = self.scheduler.add_noise(latent_init, fresh_bg_noise, t_tensor)

    # (B) πŸ”‘ BLENDED LATENT DIFFUSION β€” the entire idea in one line:
    latents_fg = mask_tensor * latents_fg + (1.0 - mask_tensor) * latents_bg

    # (C) Duplicate input for CFG β†’ batch becomes [2, 4, 64, 64]
    latent_model_input = torch.cat([latents_fg] * 2)
    latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)

    # (D) Predict noise (uncond + cond in one forward pass)
    with torch.no_grad():
        noise_pred = self.unet(latent_model_input, t,
                               encoder_hidden_states=text_embeddings).sample

    # (E) πŸ”‘ CFG math β€” extrapolate AWAY from uncond TOWARDS cond
    noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
    noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)

    # (F) One step closer to t=0
    latents_fg = self.scheduler.step(noise_pred, t, latents_fg).prev_sample

# Final pin at t=0 β†’ unmasked pixels stay bit-identical through the VAE roundtrip
latents_fg = mask_tensor * latents_fg + (1.0 - mask_tensor) * latent_init

πŸ”‘ Cheat sheet β€” the three equations that define this whole script:

Equation
Blend latent = mask Β· fg + (1 βˆ’ mask) Β· bg
CFG noise = noise_uncond + scale Β· (noise_cond βˆ’ noise_uncond)
Step fg ← scheduler.step(noise, t, fg).prev_sample

Step 5 β€” Decoding (decode_from_latent)

Logic. Reverse Step 2. Undo the 0.18215 scale, run the VAE decoder, rescale [-1, 1] β†’ [0, 1], then re-arrange tensor axes to (H, W, C) for PIL.Image.fromarray.

def decode_from_latent(self, blended_latent):
    latents = blended_latent / 0.18215      # πŸ”‘ Undo VAE scaling

    with torch.no_grad():
        image_tensor = self.autoencoder.decode(latents).sample

    image_tensor = (image_tensor / 2 + 0.5).clamp(0, 1)         # [-1,1] β†’ [0,1]
    image_tensor = image_tensor.cpu().permute(0, 2, 3, 1).float().numpy()
    image_numpy  = (image_tensor * 255).astype("uint8")[0]
    return PIL.Image.fromarray(image_numpy)

πŸ”‘ Cheat sheet: permute(0, 2, 3, 1) = (B, C, H, W) β†’ (B, H, W, C) before PIL.

Step 6 β€” Execution (main)

blended = BlendedLatentDiffusion()
output = blended.blended_latent_diffusion(
    init_image=PIL.Image.open("input.jpg"),
    mask_image=PIL.Image.open("mask.png"),      # white = edit region
    prompt="fluffy white clouds in a bright blue sky, highly detailed",
    num_inference_steps=25,
    strength=0.95,         # Full overwrite of masked area
    guidance_scale=12.0,   # Strong prompt adhesion
)
output.save("output_image.jpg")

Knob-tuning cheat sheet (blended_loop)

Parameter Range Effect
num_inference_steps 20–50 More = higher quality, slower. DPM++ converges fast β€” 25 is a sweet spot.
strength 0.0–1.0 How far back in the noise schedule we start. 1.0 = pure noise inside mask.
guidance_scale 1.0–15.0 CFG weight. 7.5 standard. Higher = more prompt-faithful, more saturated.
mask (white) binary Region that will be regenerated. Black = preserved.

Tutorial 02 β€” Concept Erasure (concept_erasure.py)

The premise. Standard CFG adds one positive pull toward the prompt. Concept Erasure adds N negative pulls β€” one per unwanted concept β€” so the model actively avoids hallucinating each of them. This is how you stop a "forest road at night" from sprouting streetlights and headlights it was never asked for.

Pipeline overview (parallel to Tutorial 01; the novel logic is in Step 3 and Step 5):

__init__  β†’  encode_image  β†’  generate_noise_from_prompts (N+2 embeddings)
                                                β”‚
                                                β–Ό
                            Multi-Negative-Guidance loop (1 prompt βˆ’ N erasures)
                                                β”‚
                                                β–Ό
                                       decode_from_latent

Step 1 β€” Initialization (__init__)

Same component contract as Tutorial 01 β€” VAE, CLIP tokenizer + text encoder, UNet, DPM++ Karras scheduler, float16 + eval() on CUDA. Nothing new at this layer; the technique is implemented entirely in how we batch text and combine noise predictions.

self.autoencoder = diffusers.AutoencoderKL.from_pretrained(self.model_name, subfolder="vae")
self.tokenizer   = CLIPTokenizer.from_pretrained(self.text_model_name)
self.text_model  = CLIPTextModel.from_pretrained(self.text_model_name)
self.unet        = diffusers.UNet2DConditionModel.from_pretrained(self.model_name, subfolder="unet")
self.scheduler   = diffusers.DPMSolverMultistepScheduler.from_pretrained(
    self.model_name, subfolder="scheduler",
    algorithm_type="dpmsolver++", use_karras_sigmas=True,
)

Step 2 β€” Latent Encoding (encode_image)

Identical to encode_to_latent from Tutorial 01: resize β†’ [-1, 1] normalize β†’ VAE encode β†’ multiply by 0.18215. No mask in this technique β€” the entire image is up for revision.

def encode_image(self, image):
    preprocess = transforms.Compose([
        transforms.Resize((512, 512)),
        transforms.ToTensor(),
        transforms.Normalize([0.5], [0.5]),
    ])
    image  = preprocess(image).unsqueeze(0).to(self.device, dtype=self.dtype)
    latent = self.autoencoder.encode(image).latent_dist.sample()
    return latent * 0.18215

Step 3 β€” Batched Text Embeddings (generate_noise_from_prompts)

πŸ”‘ This is where the technique starts. We stack [uncond, cond, erase_1, erase_2, …, erase_N] along the batch axis so a single UNet forward pass yields all noise predictions in parallel.

text_input_embeddings = self.encode_text(text)              # [1, 77, 768] β€” positive prompt
uncond                = self.encode_text("")                # [1, 77, 768] β€” empty prompt

erasure_list  = [self.encode_text(p) for p in erasure_prompt]
erasure_input = torch.cat(erasure_list, dim=0)              # [N, 77, 768] β€” banned concepts

# Final batch: [uncond, cond, erase_1, ..., erase_N] β†’ shape [2 + N, 77, 768]
text_embeddings = torch.cat([uncond, text_input_embeddings, erasure_input], dim=0)

noise = torch.randn(latent_shape, device=self.device, dtype=self.dtype)

πŸ”‘ Cheat sheet: the order [uncond, cond, *erase] is a contract β€” you'll chunk(2 + N) in the loop and access [0], [1], [2:] in exactly that order.

Step 4 β€” Setup Timesteps & Inject Noise

Same strength-based slicing pattern as Tutorial 01. unsqueeze(0) makes the start timestep a 1-D tensor so add_noise doesn't trip on a 0-D scalar with DPM++.

self.scheduler.set_timesteps(num_inference_steps, device=self.device)
init_timestep = min(int(num_inference_steps * strength), num_inference_steps)
timesteps     = self.scheduler.timesteps[-init_timestep:]

start_timestep = timesteps[0].unsqueeze(0)                       # πŸ”‘ 1-D tensor
latents        = self.scheduler.add_noise(init_latent, noise, start_timestep)

Step 5 β€” The Core Hack: Multi-Negative Guidance Loop

Every step the UNet runs on a batched input of 2 + N copies of the same latent, each paired with a different text context. We split predictions into uncond, cond, and erase_1…N, then add the prompt direction and subtract each erasure direction.

total_chunks = 2 + len(erasure_prompt)

for t in timesteps:
    # (A) Broadcast latents to the (2+N) batch so they pair with each text context
    latent_model_input = torch.cat([latents] * total_chunks)
    latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)

    # (B) ONE UNet call β†’ all (2+N) noise predictions in parallel
    noise_pred = self.unet(latent_model_input, t,
                           encoder_hidden_states=text_embeddings).sample

    # (C) Split predictions in the SAME order they were stacked
    all_preds         = noise_pred.chunk(total_chunks)
    noise_pred_uncond = all_preds[0]
    noise_pred_cond   = all_preds[1]
    noise_pred_erase  = all_preds[2:]        # tuple of N tensors

    # (D) πŸ”‘ Standard CFG β€” pull TOWARD the prompt
    guided_noise = noise_pred_uncond + guidance_scale * (noise_pred_cond - noise_pred_uncond)

    # (E) πŸ”‘ Concept Erasure β€” push AWAY from each banned concept
    for noise_pred_e in noise_pred_erase:
        guided_noise -= erase_scale * (noise_pred_e - noise_pred_uncond)

    # (F) Step down to the next noise level
    latents = self.scheduler.step(guided_noise, t, latents).prev_sample

πŸ”‘ Cheat sheet β€” the equations that define this technique:

Equation
Standard CFG noise = u + s Β· (c βˆ’ u)
Erasure term (per banned concept eα΅’) noise βˆ’= wα΅’ Β· (eα΅’ βˆ’ u)
Combined noise = u + sΒ·(c βˆ’ u) βˆ’ Ξ£α΅’ wα΅’Β·(eα΅’ βˆ’ u)

Geometric intuition. Each (x βˆ’ u) is a direction vector in noise-prediction space pointing from "neutral" to that concept. CFG adds the prompt direction; erasure subtracts each banned direction. erase_scale is the magnitude of repulsion per banned concept.

Step 6 β€” Decoding (decode_from_latent)

Identical contract to Tutorial 01 β€” note this version squeeze(0)s the batch dim before permute(1, 2, 0), instead of permuting and indexing [0]:

def decode_from_latent(self, latent):
    image = self.autoencoder.decode(latent / 0.18215).sample        # πŸ”‘ Undo VAE scaling
    image = (image / 2 + 0.5).clamp(0, 1)                           # [-1,1] β†’ [0,1]
    image = image.cpu().squeeze(0).permute(1, 2, 0).float().numpy() # (C,H,W) β†’ (H,W,C)
    image = (image * 255).astype("uint8")
    return PIL.Image.fromarray(image)

Step 7 β€” Execution (main)

ce = ConceptErasure()
init_image = PIL.Image.open("scene_erasure.png").convert("RGB")

prompt         = "A road at night in the forest"
erasure_prompt = ["Streetlights", "Headlights", "Tail lights", "Lamps", "Artificial lights"]

result = ce.concept_erasure(
    init_image=init_image,
    prompt=prompt,
    erasure_prompt=erasure_prompt,
    num_inference_steps=50,
    strength=0.3,           # Light denoise β€” preserve overall scene
    guidance_scale=7.5,     # Normal CFG strength
    erase_scale=10.0,       # πŸ”‘ Aggressive repulsion from banned concepts
)
result.save("output_concept_erased.jpg")

Knob-tuning cheat sheet (concept_erasure)

Parameter Range Effect
strength 0.0–1.0 Re-denoise depth. 0.3 keeps structure intact while suppressing concepts. High strength may destroy scene composition.
guidance_scale 1.0–15.0 Strength of the positive prompt pull.
erase_scale 1.0–15.0 πŸ”‘ The new knob. Strength of the negative pull per erasure prompt. Higher = stronger erasure but more artifacts.
erasure_prompt list[str] One concept per entry. Each adds +1 to the UNet batch size at every step.

Tutorial 03 β€” DDIM Inversion (inversion_implemention.py)

The premise. Standard generation goes noise β†’ image. DDIM Inversion runs the same denoising network in reverse β€” walking timesteps 0 β†’ T instead of T β†’ 0 β€” to recover the exact noise that would, when denoised, produce a given real image. That noise becomes a deterministic handle you can later re-denoise with a different prompt β†’ the foundation of all real-image editing techniques.

Pipeline overview. Two loops, one UNet, two schedulers:

                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   real image  ─►│  Inversion loop  (0 β†’ T)        β”‚ ─►  inverted noise  x_T
                β”‚  DDIMInverseScheduler            β”‚
                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                            β”‚
                                                            β–Ό
                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚  Sampling loop   (T β†’ 0)         β”‚ ─►  reconstructed image
                β”‚  DDIMScheduler                   β”‚
                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Step 1 β€” Initialization

Logic. Two schedulers, one UNet. DDIMInverseScheduler is built via .from_config() of the forward DDIMScheduler so they share the exact same Ξ±-bar schedule β€” this symmetry is what makes inversion mathematically reversible.

self.unet = UNet2DConditionModel.from_pretrained(self.model_name, subfolder="unet")
self.vae  = AutoencoderKL.from_pretrained(self.model_name, subfolder="vae")

# 1. Standard scheduler for generation (T β†’ 0)
self.noise_scheduler   = DDIMScheduler.from_pretrained(self.model_name, subfolder="scheduler")
# 2. πŸ”‘ Inverse scheduler for inversion math (0 β†’ T) β€” same config = same Ξ±-schedule
self.inverse_scheduler = DDIMInverseScheduler.from_config(self.noise_scheduler.config)

πŸ”‘ Cheat sheet: DDIM is the only standard SD scheduler that's deterministically invertible. DPM++, Euler, LMS β€” none of them round-trip cleanly.

Step 2 β€” Latent Encoding (get_latent_image)

Logic. Two important differences from Tutorial 01's encoder β€” both critical for a faithful round-trip:

init_latents = self.vae.encode(image_tensor).latent_dist.mode()        # πŸ”‘ .mode(), not .sample()
init_latents = init_latents * self.vae.config.scaling_factor           # πŸ”‘ from config, not hardcoded

πŸ”‘ Cheat sheet:

  • .mode() returns the deterministic mean of the VAE's posterior β€” no random draw, so the round-trip is reproducible. .sample() would inject noise on encode and break inversion.
  • vae.config.scaling_factor == 0.18215 for SD-v1.4, but reading from config is robust across model variants.

Step 3 β€” Text Embeddings (get_text_embeddings)

Logic. Just one embedding β€” no uncond, no CFG. Pure DDIM inversion is deterministic and runs a single text context. (Variants like Null-Text Inversion re-introduce CFG and optimize the uncond embedding β€” separate technique, separate tutorial.)

text_embeddings = self.text_encoder(**inputs).last_hidden_state        # [1, 77, 768] β€” no uncond

Step 4 β€” The Forward Loop: Inversion (ddim_invers)

Logic. Walk timesteps 0 β†’ T. At each step, predict noise with the UNet, then ask the inverse scheduler to push the latent one step further into noise.

self.inverse_scheduler.set_timesteps(num_inference_steps, device=self.device)
timesteps = self.inverse_scheduler.timesteps         # πŸ”‘ ascending: 0 β†’ ~999

latents = init_latents.clone()
for idx, t in enumerate(timesteps):
    noise_pred = self.unet(latents, t, encoder_hidden_states=text_embeddings).sample

    # πŸ”‘ .step() on the INVERSE scheduler walks FORWARD in time.
    #    The field is still called .prev_sample but it's now the NEXT (noisier) state.
    latents = self.inverse_scheduler.step(noise_pred, t, latents).prev_sample

πŸ”‘ Cheat sheet β€” leaky abstraction watch: DDIMInverseScheduler.step().prev_sample is misnamed β€” for the inverse scheduler it means "next-step output". The diffusers API reuses the field name; only the direction of travel reverses.

Step 5 β€” The Reverse Loop: Sampling (ddim_sampling)

Logic. Identical control flow, but with the forward scheduler. Start from the inverted noise (or any noise of the right shape) and walk T β†’ 0 to recover an image.

self.noise_scheduler.set_timesteps(num_inference_steps, device=self.device)
timesteps = self.noise_scheduler.timesteps           # descending: ~999 β†’ 0

latents = inverted_latents.clone()
for idx, t in enumerate(timesteps):
    noise_pred = self.unet(latents, t, encoder_hidden_states=text_embeddings).sample
    latents    = self.noise_scheduler.step(noise_pred, t, latents).prev_sample

πŸ”‘ Cheat sheet: Same UNet, same prompt, opposite scheduler. The pair (invert β†’ sample) should reconstruct the input up to small floating-point drift. If it doesn't β†’ debug your encode (.mode()?), your scheduler (DDIM-only?), or your prompt (must match the source).

Step 6 β€” Decoding (vae_decoder)

Standard VAE round-trip β€” divide by 0.18215, decode, rescale, permute, return PIL.

def vae_decoder(self, latents):
    latents = 1 / 0.18215 * latents
    image   = self.vae.decode(latents).sample
    image   = (image / 2 + 0.5).clamp(0, 1)
    image   = image.cpu().permute(0, 2, 3, 1).float().numpy()
    return PIL.Image.fromarray((image[0] * 255).astype("uint8"))

Step 7 β€” Execution (main)

Invert real image β†’ optionally snapshot intermediate latents β†’ re-sample β†’ save.

pipeline = InversionImplementationDDIM()

# 1. Inversion: real image β†’ noise
inverted_noise, inversion_visuals = pipeline.ddim_invers(
    num_inference_steps=50,
    init_image="Road_in_Norway.jpg",
    prompt="a photo of a road in norway",
    visual_steps=[0, 1, 2],                # capture early-step latents for debugging
)

# 2. Sampling: noise β†’ reconstructed image
reconstructed_image, sampling_visuals = pipeline.ddim_sampling(
    num_inference_steps=50,
    inverted_latents=inverted_noise,
    prompt="a photo of a road in norway",
    visual_steps=[0, 1, 2],
)
reconstructed_image.save("reconstructed_final.jpg")

πŸ”‘ Cheat sheet: The reconstruction quality is your inversion's report card. If the round-trip image differs visibly from the input β†’ check (1) .mode() on encode, (2) DDIM schedulers on both sides, (3) same prompt + same step count both directions.

Editing flow

Inversion alone reconstructs. To edit, change the prompt during the sampling call:

inverted, _ = pipeline.ddim_invers (50, "road.jpg",   prompt="a photo of a road in norway")
edited,   _ = pipeline.ddim_sampling(50, inverted,    prompt="a photo of a snowy road in norway")

Knob-tuning cheat sheet (inversion)

Parameter Range Effect
num_inference_steps 50–200 More steps = more faithful round-trip. Per-step error is smaller but compounds across more steps β€” a tradeoff.
prompt str Must describe the source image accurately. A mismatched prompt biases the inverted noise and degrades reconstruction.
visual_steps list[int] Indices to capture for debugging the inversion trajectory.

Tutorial 04 β€” Prompt-to-Prompt Attention Injection (prompt_to_prompt_attention.py)

The premise. Cross-attention maps inside the UNet encode which pixels each word of the prompt is paying attention to. If you save every cross-attention map from a source run, then overwrite them on a target run with a slightly different prompt β€” keeping the random seed identical β€” the spatial layout of the source carries over while the target prompt repaints the content within that layout.

No retraining. No extra forward passes. One tensor patched at the right place in the UNet's forward pass.

Pipeline overview. Two runs, same noise seed, two different attention processors:

seed=42  β†’  Source prompt  ─►  SaveCrossAttnProcessor    ─►  source image + saved_maps[]
                                                                              β”‚
                                                                              β–Ό
seed=42  β†’  Target prompt  ─►  InjectCrossAttnProcessor(saved_maps)  ─►  target image
                                  (overrides cross-attn probs)

Step 0 β€” The architectural prerequisite

What is an attention processor? Every transformer block in the diffusers UNet routes its attention through a swappable AttnProcessor. The default one does standard QKV math. By subclassing and registering your own via unet.set_attn_processor(...), you get a hook inside every attention computation β€” and can either observe Q/K/V/probs or mutate them.

Cross-attention vs Self-attention. Inside __call__:

is_cross_attention = encoder_hidden_states is not None
  • Self-attention β†’ encoder_hidden_states is None β†’ image attending to itself, controls texture/coherence.
  • Cross-attention β†’ encoder_hidden_states is CLIP text embeddings β†’ image attending to text, controls what is where. This is the only one P2P touches.

Step 1 β€” The Save Processor

Logic. Run the full standard attention computation, but right after computing attention_probs, snapshot the cross-attention probability matrix into a list.

class SaveCrossAttnProcessor:
    def __init__(self):
        self.attention_maps = []

    def __call__(self, attn, hidden_states, encoder_hidden_states=None, attention_mask=None, **kwargs):
        # Standard QKV math (Q from image, K/V from text for cross-attn)
        query = attn.head_to_batch_dim(attn.to_q(hidden_states))

        is_cross_attention = encoder_hidden_states is not None
        if not is_cross_attention:
            encoder_hidden_states = hidden_states

        key   = attn.head_to_batch_dim(attn.to_k(encoder_hidden_states))
        value = attn.head_to_batch_dim(attn.to_v(encoder_hidden_states))

        # Attention probabilities β€” shape [BΒ·heads, seq_image, seq_text] for cross-attn
        attention_probs = attn.get_attention_scores(query, key, attention_mask)

        # πŸ”‘ The SAVE: only cross-attention, detached copy
        if is_cross_attention:
            self.attention_maps.append(attention_probs.detach().clone())

        # Standard tail: weighted sum + output projection
        hidden_states = torch.bmm(attention_probs, value)
        hidden_states = attn.batch_to_head_dim(hidden_states)
        hidden_states = attn.to_out[0](hidden_states)
        return hidden_states

πŸ”‘ Cheat sheet:

  • Cross-attention probs shape: [BΒ·heads, seq_image, seq_text] (e.g. [16, 4096, 77] for batch 2 Γ— 8 heads Γ— 64Γ—64 tokens Γ— 77 CLIP tokens).
  • .detach().clone() β†’ detach from autograd, clone so the original tensor's storage can be freed by the next step.
  • The order of saved maps is: outer = timestep, inner = layer-by-layer in UNet forward order. The inject processor consumes them in exactly the same order.

Step 2 β€” The Inject Processor

Logic. Identical QKV computation, but right before the weighted sum, replace the freshly computed attention_probs with the corresponding saved map.

class InjectCrossAttnProcessor:
    def __init__(self, saved_maps, injection_ratio=0.8):
        self.saved_maps      = saved_maps
        self.injection_ratio = injection_ratio
        self.step            = 0

    def __call__(self, attn, hidden_states, encoder_hidden_states=None, attention_mask=None, **kwargs):
        query = attn.head_to_batch_dim(attn.to_q(hidden_states))

        is_cross_attention = encoder_hidden_states is not None
        if not is_cross_attention:
            encoder_hidden_states = hidden_states

        key   = attn.head_to_batch_dim(attn.to_k(encoder_hidden_states))
        value = attn.head_to_batch_dim(attn.to_v(encoder_hidden_states))

        attention_probs = attn.get_attention_scores(query, key, attention_mask)

        # πŸ”‘ The OVERRIDE: swap target's probs for the source's saved probs
        if is_cross_attention:
            if self.step < len(self.saved_maps):
                attention_probs = self.saved_maps[self.step]
            self.step += 1

        hidden_states = torch.bmm(attention_probs, value)
        hidden_states = attn.batch_to_head_dim(hidden_states)
        hidden_states = attn.to_out[0](hidden_states)
        return hidden_states

πŸ”‘ Note on injection_ratio. In the current implementation, the field is stored but unused β€” every step where a saved map exists gets overridden. To get the canonical P2P behavior (inject only in the first N% of steps so the model can refine texture freely at the end), change the condition to:

total = len(self.saved_maps)
if self.step < int(self.injection_ratio * total):
    attention_probs = self.saved_maps[self.step]
self.step += 1

Early-step attention controls layout; late-step attention refines texture. Limiting injection to early steps preserves the source's geometry while letting the target prompt repaint detail.

Step 3 β€” Execution (main)

Logic. Two pipeline calls, same locked seed. Between them, swap the processor on the UNet.

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16
).to(device)

prompt_source = "A driving dashcam view of a sunny road in Norway"
prompt_target = "A driving dashcam view of a snowy road in Norway"

# ─── Source run ─────────────────────────────────────────
generator      = torch.manual_seed(42)                       # πŸ”‘ LOCK SEED
save_processor = SaveCrossAttnProcessor()
pipe.unet.set_attn_processor(save_processor)                 # πŸ”‘ inject the SAVE hook
source_image   = pipe(prompt_source, generator=generator, num_inference_steps=50).images[0]

# ─── Target run ─────────────────────────────────────────
generator        = torch.manual_seed(42)                     # πŸ”‘ SAME SEED β†’ same initial noise
inject_processor = InjectCrossAttnProcessor(saved_maps=save_processor.attention_maps)
pipe.unet.set_attn_processor(inject_processor)               # πŸ”‘ swap to INJECT hook
target_image     = pipe(prompt_target, generator=generator, num_inference_steps=50).images[0]

πŸ”‘ Why the seed lock matters. P2P relies on the initial latent noise being identical between runs. Different noise β†’ different geometry from step 1 β†’ the saved cross-attention maps no longer correspond to anything in the target run's spatial layout.

πŸ”‘ Why the target prompt should be a minimal edit. P2P only carries over spatial layout. If prompt_target is structurally very different from prompt_source (changing nouns, verbs, and composition at once), the injected maps will fight the target prompt and you'll get artifacts. Word-level swaps and adjective changes β†’ clean results.

Knob-tuning cheat sheet (prompt-to-prompt)

Parameter Range Effect
seed int Must match between source and target runs. Different seed = broken layout transfer.
injection_ratio 0.0–1.0 Fraction of steps to inject. Lower = looser layout, more target-texture freedom. Currently dead code β€” patch as shown above.
prompt_target str Should be a minimal edit of prompt_source (one or two word swaps).
num_inference_steps 20–50 Must match between runs so saved_maps indexing lines up.

Where to extend this

  • Word-level swap maps β€” instead of dumping every map, weight specific source-word columns ("sunny") onto specific target-word columns ("snowy") of the probs.
  • Map reweighting β€” scale specific text-token columns up or down to amplify/suppress concepts without swapping prompts.
  • Layer-selective injection β€” only inject at certain UNet resolutions (low-res down-blocks for global layout, high-res up-blocks for detail).

Comparison β€” at a glance

Blended Diffusion Concept Erasure DDIM Inversion Prompt-to-Prompt
Hack lives in Latents Noise prediction Scheduler direction UNet attention
Constraint domain Spatial (mask) Semantic (text vectors) Temporal (time-reversal) Architectural (attention maps)
Batch size during loop 2 (uncond + cond) 2 + N (+ N erase) 1 (no CFG) 2 (uncond + cond) β€” but twice
Extra forward passes 0 0 (batched) 0 +1 full generation (source run)
What it enables Localized regional edits Suppressing hallucinated concepts Real-image editing & reconstruction Semantic edits preserving layout
Key equation / mechanic latent = maskΒ·fg + (1βˆ’mask)Β·bg noise = u + sΒ·(cβˆ’u) βˆ’ Ξ£α΅’ wα΅’Β·(eα΅’βˆ’u) x_{t+1} = inverse_step(x_t, Ξ΅_ΞΈ) attention_probs ← saved_maps[step]
Determinism Stochastic (noise sample) Stochastic Deterministic Stochastic, but seed-locked

Requirements

  • Python β‰₯ 3.10
  • torch, diffusers, transformers, torchvision, pillow, numpy
  • CUDA GPU recommended (CPU works in float32 but is slow)

Why this repo exists

Most tutorials wrap StableDiffusionPipeline and call .generate(). This repo does the opposite: every script rebuilds a capability from its raw building blocks so that the loop, the scheduler, the CFG math, and the VAE contract are all visible and editable. If you can read these scripts end-to-end, you can modify any diffusion pipeline.

License

Educational reference. Model weights (CompVis/stable-diffusion-v1-4, openai/clip-vit-large-patch14) follow their respective licenses on the Hugging Face Hub.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support