# HF Diffusers Deconstruct Core99 > **Code Flow Reference** — a step-by-step cheat sheet for understanding how a Stable Diffusion pipeline is built from raw `diffusers` components. > > Every tutorial below maps 1-to-1 to a script in this repo and follows the **exact execution order** of the code. Read top-to-bottom and you read the pipeline. ## Tutorials | # | Script | Technique | The "hack" lives in | |---|---|---|---| | 01 | [`blended_loop.py`](./blended_loop.py) | **Blended Latent Diffusion** — mask-guided regional editing | Latents (spatial blend per step) | | 02 | [`concept_erasure.py`](./concept_erasure.py) | **Concept Erasure** — prevent the model from hallucinating banned concepts | Noise prediction (extra negative-CFG terms) | | 03 | [`inversion_implemention.py`](./inversion_implemention.py) | **DDIM Inversion** — recover the exact noise that reconstructs a real image | The scheduler (running it backwards) | | 04 | [`prompt_to_prompt_attention.py`](./prompt_to_prompt_attention.py) | **Prompt-to-Prompt** — edit an image by swapping cross-attention maps | The UNet (custom `AttnProcessor` hook) | --- ## Shared Pipeline at a glance ``` __init__ → encode_to_latent → build_text_embeddings → denoising_loop → decode_from_latent ``` | Component | Role | Class | |---|---|---| | VAE | Image ↔ latent (4×64×64) | `AutoencoderKL` | | Tokenizer | Text → token IDs | `CLIPTokenizer` | | Text Encoder | Token IDs → embeddings `[B, 77, 768]` | `CLIPTextModel` | | UNet | Predict noise at timestep `t` | `UNet2DConditionModel` | | Scheduler | Manage noise schedule + sampling | `DPMSolverMultistepScheduler` | 🔑 **Two invariants used by every script in this repo:** 1. **VAE scaling:** multiply latents by `0.18215` after encode, divide by `0.18215` before decode. 2. **Mask/Latent grid:** `512 / 8 = 64`, so masks and noise live on a `64×64` grid. --- # Tutorial 01 — Blended Latent Diffusion (`blended_loop.py`) ## Step 1 — Initialization (`__init__`) **Logic.** We do *not* use `StableDiffusionPipeline`. Each weight set is loaded individually via `from_pretrained(..., subfolder=...)`, moved to GPU, and cast to `float16` for ~2× speed and ~50% VRAM savings. `eval()` disables dropout. The scheduler chosen here is **DPM-Solver++** with Karras sigmas — it converges in far fewer steps than DDIM. ```python model_name = "CompVis/stable-diffusion-v1-4" text_model_name = "openai/clip-vit-large-patch14" self.device = "cuda" if torch.cuda.is_available() else "cpu" self.dtype = torch.float16 if self.device == "cuda" else torch.float32 # Load components individually — no high-level pipeline wrapper self.autoencoder = diffusers.AutoencoderKL.from_pretrained(model_name, subfolder="vae") self.text_encoder = CLIPTextModel.from_pretrained(text_model_name) self.tokenizer = CLIPTokenizer.from_pretrained(text_model_name) self.unet = diffusers.UNet2DConditionModel.from_pretrained(model_name, subfolder="unet") self.scheduler = diffusers.DPMSolverMultistepScheduler.from_pretrained( model_name, subfolder="scheduler", algorithm_type="dpmsolver++", use_karras_sigmas=True, ) # Cast + move + freeze self.autoencoder.to(device=self.device, dtype=self.dtype).eval() self.text_encoder.to(device=self.device, dtype=self.dtype).eval() self.unet.to(device=self.device, dtype=self.dtype).eval() ``` 🔑 **Cheat sheet:** *Whatever pipeline you build, it is always these 5 components + their dtype/device contract.* ## Step 2 — Latent Encoding (`encode_to_latent` & `preprocess_mask`) **Logic.** Stable Diffusion does not denoise pixels — it denoises a `4×64×64` latent. The VAE encoder shrinks `512×512×3 → 4×64×64`, and outputs are scaled by the magic constant **`0.18215`** so that latents have roughly unit variance (this is what the UNet was trained on). The mask must be downsampled to the same `64×64` latent grid and broadcast across the 4 channels. ```python def encode_to_latent(self, init_image): preprocess = transforms.Compose([ transforms.Resize((512, 512)), transforms.ToTensor(), transforms.Normalize([0.5], [0.5]), # → range [-1, 1] ]) input_tensor = preprocess(init_image).unsqueeze(0).to(self.device, dtype=self.dtype) with torch.no_grad(): latents = self.autoencoder.encode(input_tensor).latent_dist.sample() return latents * 0.18215 # 🔑 VAE scaling factor — DO NOT FORGET ``` ```python def preprocess_mask(self, mask_image): # Latent grid is 512 / 8 = 64 mask = mask_image.resize((64, 64), resample=PIL.Image.NEAREST) mask = transforms.ToTensor()(mask).to(self.device, dtype=self.dtype) # [1, 64, 64] return mask.unsqueeze(0) # [1, 1, 64, 64] — broadcasts over 4 latent channels ``` 🔑 **Cheat sheet:** - `NEAREST` resampling — never `BILINEAR` for masks (it would leak grey edges). - `unsqueeze(0)` gives shape `[1, 1, 64, 64]` → broadcasts cleanly against `[1, 4, 64, 64]` latents. - `* 0.18215` on encode, `/ 0.18215` on decode. ## Step 3 — Text & Noise Prep (`generate_noise_from_prompt`) **Logic.** Classifier-Free Guidance (CFG) requires **two** forward passes per step: one with the prompt, one with an empty prompt. The trick is to do both in a single batched UNet call by concatenating embeddings along the batch axis → shape `[2, 77, 768]`. ```python # 1. Conditional (positive) prompt text_inputs = self.tokenizer(prompt, padding="max_length", max_length=self.tokenizer.model_max_length, return_tensors="pt") text_embeddings = self.text_encoder(text_inputs.input_ids.to(self.device)).last_hidden_state # 2. Unconditional (empty) prompt — drives negative guidance uncond_inputs = self.tokenizer("", padding="max_length", max_length=self.tokenizer.model_max_length, return_tensors="pt") uncond_embeddings = self.text_encoder(uncond_inputs.input_ids.to(self.device)).last_hidden_state # 3. Stack them → one UNet call handles both branches in parallel text_embeddings = torch.cat([uncond_embeddings, text_embeddings]) # [2, 77, 768] # 4. Base Gaussian noise — same shape as the latent init_noise = torch.randn(latent_shape, device=self.device, dtype=self.dtype) ``` 🔑 **Cheat sheet:** Order matters → `[uncond, cond]`. You'll `chunk(2)` in the same order during the loop. ## Step 4 — The Core Hack: Denoising Loop (`blend_latent_with_mask`) **Logic.** At **every** timestep we re-noise the *clean original latent* up to the current `t` and use it as the **background**. The masked **foreground** is the latent actively being denoised. Spatial blending at every step is what keeps the unmasked region pixel-faithful to the original. ### 4a. Slice the schedule via `strength` ```python self.scheduler.set_timesteps(num_inference_steps, device=self.device) init_timestep_idx = int(num_inference_steps * (1 - strength)) timesteps = self.scheduler.timesteps[init_timestep_idx:] if hasattr(self.scheduler, "set_begin_index"): self.scheduler.set_begin_index(init_timestep_idx) ``` ### 4b. Initialize foreground noise ```python # 🔑 FIX: wrap scalar timestep as a 1-D tensor or DPM++ throws IndexError start_t = timesteps[0].item() start_timestep_tensor = torch.tensor([start_t], device=self.device, dtype=torch.long) latents_fg = self.scheduler.add_noise(latent_init, init_noise, start_timestep_tensor) # Background noise vector — sampled ONCE, reused every step for trajectory consistency fresh_bg_noise = torch.randn_like(latent_init) ``` ### 4c. The loop — re-noise BG, blend, predict, CFG, step ```python for idx, t in enumerate(timesteps): t_tensor = torch.tensor([t.item()], device=self.device, dtype=torch.long) # (A) Re-noise the ORIGINAL clean latent up to current t → background latents_bg = self.scheduler.add_noise(latent_init, fresh_bg_noise, t_tensor) # (B) 🔑 BLENDED LATENT DIFFUSION — the entire idea in one line: latents_fg = mask_tensor * latents_fg + (1.0 - mask_tensor) * latents_bg # (C) Duplicate input for CFG → batch becomes [2, 4, 64, 64] latent_model_input = torch.cat([latents_fg] * 2) latent_model_input = self.scheduler.scale_model_input(latent_model_input, t) # (D) Predict noise (uncond + cond in one forward pass) with torch.no_grad(): noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample # (E) 🔑 CFG math — extrapolate AWAY from uncond TOWARDS cond noise_pred_uncond, noise_pred_text = noise_pred.chunk(2) noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond) # (F) One step closer to t=0 latents_fg = self.scheduler.step(noise_pred, t, latents_fg).prev_sample # Final pin at t=0 → unmasked pixels stay bit-identical through the VAE roundtrip latents_fg = mask_tensor * latents_fg + (1.0 - mask_tensor) * latent_init ``` 🔑 **Cheat sheet — the three equations that define this whole script:** | | Equation | |---|---| | **Blend** | `latent = mask · fg + (1 − mask) · bg` | | **CFG** | `noise = noise_uncond + scale · (noise_cond − noise_uncond)` | | **Step** | `fg ← scheduler.step(noise, t, fg).prev_sample` | ## Step 5 — Decoding (`decode_from_latent`) **Logic.** Reverse Step 2. Undo the `0.18215` scale, run the VAE decoder, rescale `[-1, 1] → [0, 1]`, then re-arrange tensor axes to `(H, W, C)` for `PIL.Image.fromarray`. ```python def decode_from_latent(self, blended_latent): latents = blended_latent / 0.18215 # 🔑 Undo VAE scaling with torch.no_grad(): image_tensor = self.autoencoder.decode(latents).sample image_tensor = (image_tensor / 2 + 0.5).clamp(0, 1) # [-1,1] → [0,1] image_tensor = image_tensor.cpu().permute(0, 2, 3, 1).float().numpy() image_numpy = (image_tensor * 255).astype("uint8")[0] return PIL.Image.fromarray(image_numpy) ``` 🔑 **Cheat sheet:** `permute(0, 2, 3, 1)` = `(B, C, H, W) → (B, H, W, C)` before PIL. ## Step 6 — Execution (`main`) ```python blended = BlendedLatentDiffusion() output = blended.blended_latent_diffusion( init_image=PIL.Image.open("input.jpg"), mask_image=PIL.Image.open("mask.png"), # white = edit region prompt="fluffy white clouds in a bright blue sky, highly detailed", num_inference_steps=25, strength=0.95, # Full overwrite of masked area guidance_scale=12.0, # Strong prompt adhesion ) output.save("output_image.jpg") ``` ### Knob-tuning cheat sheet (blended_loop) | Parameter | Range | Effect | |---|---|---| | `num_inference_steps` | 20–50 | More = higher quality, slower. DPM++ converges fast — 25 is a sweet spot. | | `strength` | 0.0–1.0 | How far back in the noise schedule we start. `1.0` = pure noise inside mask. | | `guidance_scale` | 1.0–15.0 | CFG weight. `7.5` standard. Higher = more prompt-faithful, more saturated. | | `mask` (white) | binary | Region that will be regenerated. Black = preserved. | --- # Tutorial 02 — Concept Erasure (`concept_erasure.py`) > **The premise.** Standard CFG adds **one positive pull** toward the prompt. Concept Erasure adds **N negative pulls** — one per unwanted concept — so the model actively *avoids* hallucinating each of them. This is how you stop a "forest road at night" from sprouting streetlights and headlights it was never asked for. **Pipeline overview** (parallel to Tutorial 01; the novel logic is in **Step 3** and **Step 5**): ``` __init__ → encode_image → generate_noise_from_prompts (N+2 embeddings) │ ▼ Multi-Negative-Guidance loop (1 prompt − N erasures) │ ▼ decode_from_latent ``` ## Step 1 — Initialization (`__init__`) Same component contract as Tutorial 01 — VAE, CLIP tokenizer + text encoder, UNet, DPM++ Karras scheduler, `float16` + `eval()` on CUDA. Nothing new at this layer; the technique is implemented entirely in **how we batch text and combine noise predictions**. ```python self.autoencoder = diffusers.AutoencoderKL.from_pretrained(self.model_name, subfolder="vae") self.tokenizer = CLIPTokenizer.from_pretrained(self.text_model_name) self.text_model = CLIPTextModel.from_pretrained(self.text_model_name) self.unet = diffusers.UNet2DConditionModel.from_pretrained(self.model_name, subfolder="unet") self.scheduler = diffusers.DPMSolverMultistepScheduler.from_pretrained( self.model_name, subfolder="scheduler", algorithm_type="dpmsolver++", use_karras_sigmas=True, ) ``` ## Step 2 — Latent Encoding (`encode_image`) Identical to `encode_to_latent` from Tutorial 01: resize → `[-1, 1]` normalize → VAE encode → multiply by `0.18215`. No mask in this technique — the entire image is up for revision. ```python def encode_image(self, image): preprocess = transforms.Compose([ transforms.Resize((512, 512)), transforms.ToTensor(), transforms.Normalize([0.5], [0.5]), ]) image = preprocess(image).unsqueeze(0).to(self.device, dtype=self.dtype) latent = self.autoencoder.encode(image).latent_dist.sample() return latent * 0.18215 ``` ## Step 3 — Batched Text Embeddings (`generate_noise_from_prompts`) 🔑 **This is where the technique starts.** We stack **`[uncond, cond, erase_1, erase_2, …, erase_N]`** along the batch axis so a single UNet forward pass yields *all* noise predictions in parallel. ```python text_input_embeddings = self.encode_text(text) # [1, 77, 768] — positive prompt uncond = self.encode_text("") # [1, 77, 768] — empty prompt erasure_list = [self.encode_text(p) for p in erasure_prompt] erasure_input = torch.cat(erasure_list, dim=0) # [N, 77, 768] — banned concepts # Final batch: [uncond, cond, erase_1, ..., erase_N] → shape [2 + N, 77, 768] text_embeddings = torch.cat([uncond, text_input_embeddings, erasure_input], dim=0) noise = torch.randn(latent_shape, device=self.device, dtype=self.dtype) ``` 🔑 **Cheat sheet:** the order `[uncond, cond, *erase]` is a contract — you'll `chunk(2 + N)` in the loop and access `[0]`, `[1]`, `[2:]` in exactly that order. ## Step 4 — Setup Timesteps & Inject Noise Same `strength`-based slicing pattern as Tutorial 01. `unsqueeze(0)` makes the start timestep a 1-D tensor so `add_noise` doesn't trip on a 0-D scalar with DPM++. ```python self.scheduler.set_timesteps(num_inference_steps, device=self.device) init_timestep = min(int(num_inference_steps * strength), num_inference_steps) timesteps = self.scheduler.timesteps[-init_timestep:] start_timestep = timesteps[0].unsqueeze(0) # 🔑 1-D tensor latents = self.scheduler.add_noise(init_latent, noise, start_timestep) ``` ## Step 5 — The Core Hack: Multi-Negative Guidance Loop Every step the UNet runs on a batched input of `2 + N` copies of the same latent, each paired with a different text context. We split predictions into `uncond`, `cond`, and `erase_1…N`, then **add the prompt direction and subtract each erasure direction**. ```python total_chunks = 2 + len(erasure_prompt) for t in timesteps: # (A) Broadcast latents to the (2+N) batch so they pair with each text context latent_model_input = torch.cat([latents] * total_chunks) latent_model_input = self.scheduler.scale_model_input(latent_model_input, t) # (B) ONE UNet call → all (2+N) noise predictions in parallel noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample # (C) Split predictions in the SAME order they were stacked all_preds = noise_pred.chunk(total_chunks) noise_pred_uncond = all_preds[0] noise_pred_cond = all_preds[1] noise_pred_erase = all_preds[2:] # tuple of N tensors # (D) 🔑 Standard CFG — pull TOWARD the prompt guided_noise = noise_pred_uncond + guidance_scale * (noise_pred_cond - noise_pred_uncond) # (E) 🔑 Concept Erasure — push AWAY from each banned concept for noise_pred_e in noise_pred_erase: guided_noise -= erase_scale * (noise_pred_e - noise_pred_uncond) # (F) Step down to the next noise level latents = self.scheduler.step(guided_noise, t, latents).prev_sample ``` 🔑 **Cheat sheet — the equations that define this technique:** | | Equation | |---|---| | **Standard CFG** | `noise = u + s · (c − u)` | | **Erasure term** *(per banned concept `eᵢ`)* | `noise −= wᵢ · (eᵢ − u)` | | **Combined** | `noise = u + s·(c − u) − Σᵢ wᵢ·(eᵢ − u)` | **Geometric intuition.** Each `(x − u)` is a *direction vector* in noise-prediction space pointing from "neutral" to that concept. CFG **adds** the prompt direction; erasure **subtracts** each banned direction. `erase_scale` is the magnitude of repulsion per banned concept. ## Step 6 — Decoding (`decode_from_latent`) Identical contract to Tutorial 01 — note this version `squeeze(0)`s the batch dim before `permute(1, 2, 0)`, instead of permuting and indexing `[0]`: ```python def decode_from_latent(self, latent): image = self.autoencoder.decode(latent / 0.18215).sample # 🔑 Undo VAE scaling image = (image / 2 + 0.5).clamp(0, 1) # [-1,1] → [0,1] image = image.cpu().squeeze(0).permute(1, 2, 0).float().numpy() # (C,H,W) → (H,W,C) image = (image * 255).astype("uint8") return PIL.Image.fromarray(image) ``` ## Step 7 — Execution (`main`) ```python ce = ConceptErasure() init_image = PIL.Image.open("scene_erasure.png").convert("RGB") prompt = "A road at night in the forest" erasure_prompt = ["Streetlights", "Headlights", "Tail lights", "Lamps", "Artificial lights"] result = ce.concept_erasure( init_image=init_image, prompt=prompt, erasure_prompt=erasure_prompt, num_inference_steps=50, strength=0.3, # Light denoise — preserve overall scene guidance_scale=7.5, # Normal CFG strength erase_scale=10.0, # 🔑 Aggressive repulsion from banned concepts ) result.save("output_concept_erased.jpg") ``` ### Knob-tuning cheat sheet (concept_erasure) | Parameter | Range | Effect | |---|---|---| | `strength` | 0.0–1.0 | Re-denoise depth. `0.3` keeps structure intact while suppressing concepts. High `strength` may destroy scene composition. | | `guidance_scale` | 1.0–15.0 | Strength of the **positive** prompt pull. | | `erase_scale` | 1.0–15.0 | 🔑 The new knob. Strength of the **negative** pull per erasure prompt. Higher = stronger erasure but more artifacts. | | `erasure_prompt` | `list[str]` | One concept per entry. Each adds **+1** to the UNet batch size at every step. | --- # Tutorial 03 — DDIM Inversion (`inversion_implemention.py`) > **The premise.** Standard generation goes *noise → image*. **DDIM Inversion** runs the same denoising network **in reverse** — walking timesteps `0 → T` instead of `T → 0` — to recover the exact noise that would, when denoised, produce a given real image. That noise becomes a deterministic *handle* you can later re-denoise with a different prompt → the foundation of all real-image editing techniques. **Pipeline overview.** Two loops, one UNet, two schedulers: ``` ┌──────────────────────────────────┐ real image ─►│ Inversion loop (0 → T) │ ─► inverted noise x_T │ DDIMInverseScheduler │ └──────────────────────────────────┘ │ ▼ ┌──────────────────────────────────┐ │ Sampling loop (T → 0) │ ─► reconstructed image │ DDIMScheduler │ └──────────────────────────────────┘ ``` ## Step 1 — Initialization **Logic.** Two schedulers, one UNet. `DDIMInverseScheduler` is built via `.from_config()` of the forward `DDIMScheduler` so they share the **exact same α-bar schedule** — this symmetry is what makes inversion mathematically reversible. ```python self.unet = UNet2DConditionModel.from_pretrained(self.model_name, subfolder="unet") self.vae = AutoencoderKL.from_pretrained(self.model_name, subfolder="vae") # 1. Standard scheduler for generation (T → 0) self.noise_scheduler = DDIMScheduler.from_pretrained(self.model_name, subfolder="scheduler") # 2. 🔑 Inverse scheduler for inversion math (0 → T) — same config = same α-schedule self.inverse_scheduler = DDIMInverseScheduler.from_config(self.noise_scheduler.config) ``` 🔑 **Cheat sheet:** DDIM is the *only* standard SD scheduler that's deterministically invertible. DPM++, Euler, LMS — none of them round-trip cleanly. ## Step 2 — Latent Encoding (`get_latent_image`) **Logic.** Two important differences from Tutorial 01's encoder — both critical for a faithful round-trip: ```python init_latents = self.vae.encode(image_tensor).latent_dist.mode() # 🔑 .mode(), not .sample() init_latents = init_latents * self.vae.config.scaling_factor # 🔑 from config, not hardcoded ``` 🔑 **Cheat sheet:** - `.mode()` returns the **deterministic** mean of the VAE's posterior — no random draw, so the round-trip is reproducible. `.sample()` would inject noise on encode and break inversion. - `vae.config.scaling_factor == 0.18215` for SD-v1.4, but reading from config is robust across model variants. ## Step 3 — Text Embeddings (`get_text_embeddings`) **Logic.** Just **one** embedding — no uncond, no CFG. Pure DDIM inversion is deterministic and runs a single text context. (Variants like *Null-Text Inversion* re-introduce CFG and optimize the uncond embedding — separate technique, separate tutorial.) ```python text_embeddings = self.text_encoder(**inputs).last_hidden_state # [1, 77, 768] — no uncond ``` ## Step 4 — The Forward Loop: Inversion (`ddim_invers`) **Logic.** Walk timesteps `0 → T`. At each step, predict noise with the UNet, then ask the **inverse** scheduler to push the latent one step **further into noise**. ```python self.inverse_scheduler.set_timesteps(num_inference_steps, device=self.device) timesteps = self.inverse_scheduler.timesteps # 🔑 ascending: 0 → ~999 latents = init_latents.clone() for idx, t in enumerate(timesteps): noise_pred = self.unet(latents, t, encoder_hidden_states=text_embeddings).sample # 🔑 .step() on the INVERSE scheduler walks FORWARD in time. # The field is still called .prev_sample but it's now the NEXT (noisier) state. latents = self.inverse_scheduler.step(noise_pred, t, latents).prev_sample ``` 🔑 **Cheat sheet — leaky abstraction watch:** `DDIMInverseScheduler.step().prev_sample` is misnamed — for the inverse scheduler it means "next-step output". The diffusers API reuses the field name; only the direction of travel reverses. ## Step 5 — The Reverse Loop: Sampling (`ddim_sampling`) **Logic.** Identical control flow, but with the *forward* scheduler. Start from the inverted noise (or any noise of the right shape) and walk `T → 0` to recover an image. ```python self.noise_scheduler.set_timesteps(num_inference_steps, device=self.device) timesteps = self.noise_scheduler.timesteps # descending: ~999 → 0 latents = inverted_latents.clone() for idx, t in enumerate(timesteps): noise_pred = self.unet(latents, t, encoder_hidden_states=text_embeddings).sample latents = self.noise_scheduler.step(noise_pred, t, latents).prev_sample ``` 🔑 **Cheat sheet:** Same UNet, same prompt, opposite scheduler. The pair `(invert → sample)` should reconstruct the input up to small floating-point drift. If it doesn't → debug your encode (`.mode()`?), your scheduler (DDIM-only?), or your prompt (must match the source). ## Step 6 — Decoding (`vae_decoder`) Standard VAE round-trip — divide by `0.18215`, decode, rescale, permute, return PIL. ```python def vae_decoder(self, latents): latents = 1 / 0.18215 * latents image = self.vae.decode(latents).sample image = (image / 2 + 0.5).clamp(0, 1) image = image.cpu().permute(0, 2, 3, 1).float().numpy() return PIL.Image.fromarray((image[0] * 255).astype("uint8")) ``` ## Step 7 — Execution (`main`) Invert real image → optionally snapshot intermediate latents → re-sample → save. ```python pipeline = InversionImplementationDDIM() # 1. Inversion: real image → noise inverted_noise, inversion_visuals = pipeline.ddim_invers( num_inference_steps=50, init_image="Road_in_Norway.jpg", prompt="a photo of a road in norway", visual_steps=[0, 1, 2], # capture early-step latents for debugging ) # 2. Sampling: noise → reconstructed image reconstructed_image, sampling_visuals = pipeline.ddim_sampling( num_inference_steps=50, inverted_latents=inverted_noise, prompt="a photo of a road in norway", visual_steps=[0, 1, 2], ) reconstructed_image.save("reconstructed_final.jpg") ``` 🔑 **Cheat sheet:** The reconstruction quality is your inversion's report card. If the round-trip image differs visibly from the input → check (1) `.mode()` on encode, (2) DDIM schedulers on both sides, (3) same prompt + same step count both directions. ### Editing flow Inversion alone reconstructs. To **edit**, change the prompt during the **sampling** call: ```python inverted, _ = pipeline.ddim_invers (50, "road.jpg", prompt="a photo of a road in norway") edited, _ = pipeline.ddim_sampling(50, inverted, prompt="a photo of a snowy road in norway") ``` ### Knob-tuning cheat sheet (inversion) | Parameter | Range | Effect | |---|---|---| | `num_inference_steps` | 50–200 | More steps = more faithful round-trip. Per-step error is smaller but compounds across more steps — a tradeoff. | | `prompt` | str | **Must describe the source image accurately**. A mismatched prompt biases the inverted noise and degrades reconstruction. | | `visual_steps` | `list[int]` | Indices to capture for debugging the inversion trajectory. | --- # Tutorial 04 — Prompt-to-Prompt Attention Injection (`prompt_to_prompt_attention.py`) > **The premise.** Cross-attention maps inside the UNet encode *which pixels each word of the prompt is paying attention to*. If you **save** every cross-attention map from a source run, then **overwrite** them on a target run with a slightly different prompt — keeping the random seed identical — the spatial layout of the source carries over while the target prompt repaints the *content* within that layout. > > No retraining. No extra forward passes. One tensor patched at the right place in the UNet's forward pass. **Pipeline overview.** Two runs, same noise seed, two different attention processors: ``` seed=42 → Source prompt ─► SaveCrossAttnProcessor ─► source image + saved_maps[] │ ▼ seed=42 → Target prompt ─► InjectCrossAttnProcessor(saved_maps) ─► target image (overrides cross-attn probs) ``` ## Step 0 — The architectural prerequisite **What is an attention processor?** Every transformer block in the diffusers UNet routes its attention through a swappable `AttnProcessor`. The default one does standard QKV math. By subclassing and registering your own via `unet.set_attn_processor(...)`, you get a hook inside **every** attention computation — and can either **observe** Q/K/V/probs or **mutate** them. **Cross-attention vs Self-attention.** Inside `__call__`: ```python is_cross_attention = encoder_hidden_states is not None ``` - **Self-attention** → `encoder_hidden_states is None` → image attending to itself, controls texture/coherence. - **Cross-attention** → `encoder_hidden_states` is CLIP text embeddings → image attending to text, controls **what is where**. *This is the only one P2P touches.* ## Step 1 — The Save Processor **Logic.** Run the full standard attention computation, but right after computing `attention_probs`, snapshot the cross-attention probability matrix into a list. ```python class SaveCrossAttnProcessor: def __init__(self): self.attention_maps = [] def __call__(self, attn, hidden_states, encoder_hidden_states=None, attention_mask=None, **kwargs): # Standard QKV math (Q from image, K/V from text for cross-attn) query = attn.head_to_batch_dim(attn.to_q(hidden_states)) is_cross_attention = encoder_hidden_states is not None if not is_cross_attention: encoder_hidden_states = hidden_states key = attn.head_to_batch_dim(attn.to_k(encoder_hidden_states)) value = attn.head_to_batch_dim(attn.to_v(encoder_hidden_states)) # Attention probabilities — shape [B·heads, seq_image, seq_text] for cross-attn attention_probs = attn.get_attention_scores(query, key, attention_mask) # 🔑 The SAVE: only cross-attention, detached copy if is_cross_attention: self.attention_maps.append(attention_probs.detach().clone()) # Standard tail: weighted sum + output projection hidden_states = torch.bmm(attention_probs, value) hidden_states = attn.batch_to_head_dim(hidden_states) hidden_states = attn.to_out[0](hidden_states) return hidden_states ``` 🔑 **Cheat sheet:** - Cross-attention probs shape: `[B·heads, seq_image, seq_text]` (e.g. `[16, 4096, 77]` for batch 2 × 8 heads × `64×64` tokens × 77 CLIP tokens). - `.detach().clone()` → detach from autograd, clone so the original tensor's storage can be freed by the next step. - The order of saved maps is: **outer = timestep, inner = layer-by-layer in UNet forward order**. The inject processor consumes them in exactly the same order. ## Step 2 — The Inject Processor **Logic.** Identical QKV computation, but right before the weighted sum, **replace** the freshly computed `attention_probs` with the corresponding saved map. ```python class InjectCrossAttnProcessor: def __init__(self, saved_maps, injection_ratio=0.8): self.saved_maps = saved_maps self.injection_ratio = injection_ratio self.step = 0 def __call__(self, attn, hidden_states, encoder_hidden_states=None, attention_mask=None, **kwargs): query = attn.head_to_batch_dim(attn.to_q(hidden_states)) is_cross_attention = encoder_hidden_states is not None if not is_cross_attention: encoder_hidden_states = hidden_states key = attn.head_to_batch_dim(attn.to_k(encoder_hidden_states)) value = attn.head_to_batch_dim(attn.to_v(encoder_hidden_states)) attention_probs = attn.get_attention_scores(query, key, attention_mask) # 🔑 The OVERRIDE: swap target's probs for the source's saved probs if is_cross_attention: if self.step < len(self.saved_maps): attention_probs = self.saved_maps[self.step] self.step += 1 hidden_states = torch.bmm(attention_probs, value) hidden_states = attn.batch_to_head_dim(hidden_states) hidden_states = attn.to_out[0](hidden_states) return hidden_states ``` 🔑 **Note on `injection_ratio`.** In the current implementation, the field is **stored but unused** — every step where a saved map exists gets overridden. To get the canonical P2P behavior (inject only in the **first N%** of steps so the model can refine texture freely at the end), change the condition to: ```python total = len(self.saved_maps) if self.step < int(self.injection_ratio * total): attention_probs = self.saved_maps[self.step] self.step += 1 ``` Early-step attention controls **layout**; late-step attention refines **texture**. Limiting injection to early steps preserves the source's geometry while letting the target prompt repaint detail. ## Step 3 — Execution (`main`) **Logic.** Two pipeline calls, same locked seed. Between them, swap the processor on the UNet. ```python pipe = StableDiffusionPipeline.from_pretrained( "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16 ).to(device) prompt_source = "A driving dashcam view of a sunny road in Norway" prompt_target = "A driving dashcam view of a snowy road in Norway" # ─── Source run ───────────────────────────────────────── generator = torch.manual_seed(42) # 🔑 LOCK SEED save_processor = SaveCrossAttnProcessor() pipe.unet.set_attn_processor(save_processor) # 🔑 inject the SAVE hook source_image = pipe(prompt_source, generator=generator, num_inference_steps=50).images[0] # ─── Target run ───────────────────────────────────────── generator = torch.manual_seed(42) # 🔑 SAME SEED → same initial noise inject_processor = InjectCrossAttnProcessor(saved_maps=save_processor.attention_maps) pipe.unet.set_attn_processor(inject_processor) # 🔑 swap to INJECT hook target_image = pipe(prompt_target, generator=generator, num_inference_steps=50).images[0] ``` 🔑 **Why the seed lock matters.** P2P relies on the **initial latent noise being identical** between runs. Different noise → different geometry from step 1 → the saved cross-attention maps no longer correspond to anything in the target run's spatial layout. 🔑 **Why the target prompt should be a minimal edit.** P2P only carries over spatial *layout*. If `prompt_target` is structurally very different from `prompt_source` (changing nouns, verbs, and composition at once), the injected maps will fight the target prompt and you'll get artifacts. Word-level swaps and adjective changes → clean results. ### Knob-tuning cheat sheet (prompt-to-prompt) | Parameter | Range | Effect | |---|---|---| | `seed` | int | **Must match** between source and target runs. Different seed = broken layout transfer. | | `injection_ratio` | 0.0–1.0 | Fraction of steps to inject. Lower = looser layout, more target-texture freedom. **Currently dead code — patch as shown above.** | | `prompt_target` | str | Should be a minimal edit of `prompt_source` (one or two word swaps). | | `num_inference_steps` | 20–50 | Must match between runs so `saved_maps` indexing lines up. | ### Where to extend this - **Word-level swap maps** — instead of dumping every map, weight specific source-word columns ("sunny") onto specific target-word columns ("snowy") of the probs. - **Map reweighting** — scale specific text-token columns up or down to amplify/suppress concepts without swapping prompts. - **Layer-selective injection** — only inject at certain UNet resolutions (low-res down-blocks for global layout, high-res up-blocks for detail). --- ## Comparison — at a glance | | Blended Diffusion | Concept Erasure | DDIM Inversion | Prompt-to-Prompt | |---|---|---|---|---| | **Hack lives in** | Latents | Noise prediction | Scheduler direction | UNet attention | | **Constraint domain** | Spatial (mask) | Semantic (text vectors) | Temporal (time-reversal) | Architectural (attention maps) | | **Batch size during loop** | `2` (uncond + cond) | `2 + N` (+ N erase) | `1` (no CFG) | `2` (uncond + cond) — but **twice** | | **Extra forward passes** | 0 | 0 (batched) | 0 | +1 full generation (source run) | | **What it enables** | Localized regional edits | Suppressing hallucinated concepts | Real-image editing & reconstruction | Semantic edits preserving layout | | **Key equation / mechanic** | `latent = mask·fg + (1−mask)·bg` | `noise = u + s·(c−u) − Σᵢ wᵢ·(eᵢ−u)` | `x_{t+1} = inverse_step(x_t, ε_θ)` | `attention_probs ← saved_maps[step]` | | **Determinism** | Stochastic (noise sample) | Stochastic | **Deterministic** | Stochastic, but seed-locked | --- ## Requirements - Python ≥ 3.10 - `torch`, `diffusers`, `transformers`, `torchvision`, `pillow`, `numpy` - CUDA GPU recommended (CPU works in `float32` but is slow) ## Why this repo exists Most tutorials wrap `StableDiffusionPipeline` and call `.generate()`. This repo does the opposite: every script **rebuilds** a capability from its raw building blocks so that the loop, the scheduler, the CFG math, and the VAE contract are all visible and editable. If you can read these scripts end-to-end, you can modify any diffusion pipeline. ## License Educational reference. Model weights (`CompVis/stable-diffusion-v1-4`, `openai/clip-vit-large-patch14`) follow their respective licenses on the Hugging Face Hub.