YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
- HF Diffusers Deconstruct Core99
- Tutorial 01 β Blended Latent Diffusion (
blended_loop.py) - Tutorial 02 β Concept Erasure (
concept_erasure.py)- Step 1 β Initialization (
__init__) - Step 2 β Latent Encoding (
encode_image) - Step 3 β Batched Text Embeddings (
generate_noise_from_prompts) - Step 4 β Setup Timesteps & Inject Noise
- Step 5 β The Core Hack: Multi-Negative Guidance Loop
- Step 6 β Decoding (
decode_from_latent) - Step 7 β Execution (
main)
- Step 1 β Initialization (
- Tutorial 03 β DDIM Inversion (
inversion_implemention.py) - Tutorial 04 β Prompt-to-Prompt Attention Injection (
prompt_to_prompt_attention.py)
HF Diffusers Deconstruct Core99
Code Flow Reference β a step-by-step cheat sheet for understanding how a Stable Diffusion pipeline is built from raw
diffuserscomponents.Every tutorial below maps 1-to-1 to a script in this repo and follows the exact execution order of the code. Read top-to-bottom and you read the pipeline.
Tutorials
| # | Script | Technique | The "hack" lives in |
|---|---|---|---|
| 01 | blended_loop.py |
Blended Latent Diffusion β mask-guided regional editing | Latents (spatial blend per step) |
| 02 | concept_erasure.py |
Concept Erasure β prevent the model from hallucinating banned concepts | Noise prediction (extra negative-CFG terms) |
| 03 | inversion_implemention.py |
DDIM Inversion β recover the exact noise that reconstructs a real image | The scheduler (running it backwards) |
| 04 | prompt_to_prompt_attention.py |
Prompt-to-Prompt β edit an image by swapping cross-attention maps | The UNet (custom AttnProcessor hook) |
Shared Pipeline at a glance
__init__ β encode_to_latent β build_text_embeddings β denoising_loop β decode_from_latent
| Component | Role | Class |
|---|---|---|
| VAE | Image β latent (4Γ64Γ64) | AutoencoderKL |
| Tokenizer | Text β token IDs | CLIPTokenizer |
| Text Encoder | Token IDs β embeddings [B, 77, 768] |
CLIPTextModel |
| UNet | Predict noise at timestep t |
UNet2DConditionModel |
| Scheduler | Manage noise schedule + sampling | DPMSolverMultistepScheduler |
π Two invariants used by every script in this repo:
- VAE scaling: multiply latents by
0.18215after encode, divide by0.18215before decode. - Mask/Latent grid:
512 / 8 = 64, so masks and noise live on a64Γ64grid.
Tutorial 01 β Blended Latent Diffusion (blended_loop.py)
Step 1 β Initialization (__init__)
Logic. We do not use StableDiffusionPipeline. Each weight set is loaded individually via from_pretrained(..., subfolder=...), moved to GPU, and cast to float16 for ~2Γ speed and ~50% VRAM savings. eval() disables dropout. The scheduler chosen here is DPM-Solver++ with Karras sigmas β it converges in far fewer steps than DDIM.
model_name = "CompVis/stable-diffusion-v1-4"
text_model_name = "openai/clip-vit-large-patch14"
self.device = "cuda" if torch.cuda.is_available() else "cpu"
self.dtype = torch.float16 if self.device == "cuda" else torch.float32
# Load components individually β no high-level pipeline wrapper
self.autoencoder = diffusers.AutoencoderKL.from_pretrained(model_name, subfolder="vae")
self.text_encoder = CLIPTextModel.from_pretrained(text_model_name)
self.tokenizer = CLIPTokenizer.from_pretrained(text_model_name)
self.unet = diffusers.UNet2DConditionModel.from_pretrained(model_name, subfolder="unet")
self.scheduler = diffusers.DPMSolverMultistepScheduler.from_pretrained(
model_name, subfolder="scheduler",
algorithm_type="dpmsolver++", use_karras_sigmas=True,
)
# Cast + move + freeze
self.autoencoder.to(device=self.device, dtype=self.dtype).eval()
self.text_encoder.to(device=self.device, dtype=self.dtype).eval()
self.unet.to(device=self.device, dtype=self.dtype).eval()
π Cheat sheet: Whatever pipeline you build, it is always these 5 components + their dtype/device contract.
Step 2 β Latent Encoding (encode_to_latent & preprocess_mask)
Logic. Stable Diffusion does not denoise pixels β it denoises a 4Γ64Γ64 latent. The VAE encoder shrinks 512Γ512Γ3 β 4Γ64Γ64, and outputs are scaled by the magic constant 0.18215 so that latents have roughly unit variance (this is what the UNet was trained on). The mask must be downsampled to the same 64Γ64 latent grid and broadcast across the 4 channels.
def encode_to_latent(self, init_image):
preprocess = transforms.Compose([
transforms.Resize((512, 512)),
transforms.ToTensor(),
transforms.Normalize([0.5], [0.5]), # β range [-1, 1]
])
input_tensor = preprocess(init_image).unsqueeze(0).to(self.device, dtype=self.dtype)
with torch.no_grad():
latents = self.autoencoder.encode(input_tensor).latent_dist.sample()
return latents * 0.18215 # π VAE scaling factor β DO NOT FORGET
def preprocess_mask(self, mask_image):
# Latent grid is 512 / 8 = 64
mask = mask_image.resize((64, 64), resample=PIL.Image.NEAREST)
mask = transforms.ToTensor()(mask).to(self.device, dtype=self.dtype) # [1, 64, 64]
return mask.unsqueeze(0) # [1, 1, 64, 64] β broadcasts over 4 latent channels
π Cheat sheet:
NEARESTresampling β neverBILINEARfor masks (it would leak grey edges).unsqueeze(0)gives shape[1, 1, 64, 64]β broadcasts cleanly against[1, 4, 64, 64]latents.* 0.18215on encode,/ 0.18215on decode.
Step 3 β Text & Noise Prep (generate_noise_from_prompt)
Logic. Classifier-Free Guidance (CFG) requires two forward passes per step: one with the prompt, one with an empty prompt. The trick is to do both in a single batched UNet call by concatenating embeddings along the batch axis β shape [2, 77, 768].
# 1. Conditional (positive) prompt
text_inputs = self.tokenizer(prompt, padding="max_length",
max_length=self.tokenizer.model_max_length, return_tensors="pt")
text_embeddings = self.text_encoder(text_inputs.input_ids.to(self.device)).last_hidden_state
# 2. Unconditional (empty) prompt β drives negative guidance
uncond_inputs = self.tokenizer("", padding="max_length",
max_length=self.tokenizer.model_max_length, return_tensors="pt")
uncond_embeddings = self.text_encoder(uncond_inputs.input_ids.to(self.device)).last_hidden_state
# 3. Stack them β one UNet call handles both branches in parallel
text_embeddings = torch.cat([uncond_embeddings, text_embeddings]) # [2, 77, 768]
# 4. Base Gaussian noise β same shape as the latent
init_noise = torch.randn(latent_shape, device=self.device, dtype=self.dtype)
π Cheat sheet: Order matters β [uncond, cond]. You'll chunk(2) in the same order during the loop.
Step 4 β The Core Hack: Denoising Loop (blend_latent_with_mask)
Logic. At every timestep we re-noise the clean original latent up to the current t and use it as the background. The masked foreground is the latent actively being denoised. Spatial blending at every step is what keeps the unmasked region pixel-faithful to the original.
4a. Slice the schedule via strength
self.scheduler.set_timesteps(num_inference_steps, device=self.device)
init_timestep_idx = int(num_inference_steps * (1 - strength))
timesteps = self.scheduler.timesteps[init_timestep_idx:]
if hasattr(self.scheduler, "set_begin_index"):
self.scheduler.set_begin_index(init_timestep_idx)
4b. Initialize foreground noise
# π FIX: wrap scalar timestep as a 1-D tensor or DPM++ throws IndexError
start_t = timesteps[0].item()
start_timestep_tensor = torch.tensor([start_t], device=self.device, dtype=torch.long)
latents_fg = self.scheduler.add_noise(latent_init, init_noise, start_timestep_tensor)
# Background noise vector β sampled ONCE, reused every step for trajectory consistency
fresh_bg_noise = torch.randn_like(latent_init)
4c. The loop β re-noise BG, blend, predict, CFG, step
for idx, t in enumerate(timesteps):
t_tensor = torch.tensor([t.item()], device=self.device, dtype=torch.long)
# (A) Re-noise the ORIGINAL clean latent up to current t β background
latents_bg = self.scheduler.add_noise(latent_init, fresh_bg_noise, t_tensor)
# (B) π BLENDED LATENT DIFFUSION β the entire idea in one line:
latents_fg = mask_tensor * latents_fg + (1.0 - mask_tensor) * latents_bg
# (C) Duplicate input for CFG β batch becomes [2, 4, 64, 64]
latent_model_input = torch.cat([latents_fg] * 2)
latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
# (D) Predict noise (uncond + cond in one forward pass)
with torch.no_grad():
noise_pred = self.unet(latent_model_input, t,
encoder_hidden_states=text_embeddings).sample
# (E) π CFG math β extrapolate AWAY from uncond TOWARDS cond
noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
# (F) One step closer to t=0
latents_fg = self.scheduler.step(noise_pred, t, latents_fg).prev_sample
# Final pin at t=0 β unmasked pixels stay bit-identical through the VAE roundtrip
latents_fg = mask_tensor * latents_fg + (1.0 - mask_tensor) * latent_init
π Cheat sheet β the three equations that define this whole script:
| Equation | |
|---|---|
| Blend | latent = mask Β· fg + (1 β mask) Β· bg |
| CFG | noise = noise_uncond + scale Β· (noise_cond β noise_uncond) |
| Step | fg β scheduler.step(noise, t, fg).prev_sample |
Step 5 β Decoding (decode_from_latent)
Logic. Reverse Step 2. Undo the 0.18215 scale, run the VAE decoder, rescale [-1, 1] β [0, 1], then re-arrange tensor axes to (H, W, C) for PIL.Image.fromarray.
def decode_from_latent(self, blended_latent):
latents = blended_latent / 0.18215 # π Undo VAE scaling
with torch.no_grad():
image_tensor = self.autoencoder.decode(latents).sample
image_tensor = (image_tensor / 2 + 0.5).clamp(0, 1) # [-1,1] β [0,1]
image_tensor = image_tensor.cpu().permute(0, 2, 3, 1).float().numpy()
image_numpy = (image_tensor * 255).astype("uint8")[0]
return PIL.Image.fromarray(image_numpy)
π Cheat sheet: permute(0, 2, 3, 1) = (B, C, H, W) β (B, H, W, C) before PIL.
Step 6 β Execution (main)
blended = BlendedLatentDiffusion()
output = blended.blended_latent_diffusion(
init_image=PIL.Image.open("input.jpg"),
mask_image=PIL.Image.open("mask.png"), # white = edit region
prompt="fluffy white clouds in a bright blue sky, highly detailed",
num_inference_steps=25,
strength=0.95, # Full overwrite of masked area
guidance_scale=12.0, # Strong prompt adhesion
)
output.save("output_image.jpg")
Knob-tuning cheat sheet (blended_loop)
| Parameter | Range | Effect |
|---|---|---|
num_inference_steps |
20β50 | More = higher quality, slower. DPM++ converges fast β 25 is a sweet spot. |
strength |
0.0β1.0 | How far back in the noise schedule we start. 1.0 = pure noise inside mask. |
guidance_scale |
1.0β15.0 | CFG weight. 7.5 standard. Higher = more prompt-faithful, more saturated. |
mask (white) |
binary | Region that will be regenerated. Black = preserved. |
Tutorial 02 β Concept Erasure (concept_erasure.py)
The premise. Standard CFG adds one positive pull toward the prompt. Concept Erasure adds N negative pulls β one per unwanted concept β so the model actively avoids hallucinating each of them. This is how you stop a "forest road at night" from sprouting streetlights and headlights it was never asked for.
Pipeline overview (parallel to Tutorial 01; the novel logic is in Step 3 and Step 5):
__init__ β encode_image β generate_noise_from_prompts (N+2 embeddings)
β
βΌ
Multi-Negative-Guidance loop (1 prompt β N erasures)
β
βΌ
decode_from_latent
Step 1 β Initialization (__init__)
Same component contract as Tutorial 01 β VAE, CLIP tokenizer + text encoder, UNet, DPM++ Karras scheduler, float16 + eval() on CUDA. Nothing new at this layer; the technique is implemented entirely in how we batch text and combine noise predictions.
self.autoencoder = diffusers.AutoencoderKL.from_pretrained(self.model_name, subfolder="vae")
self.tokenizer = CLIPTokenizer.from_pretrained(self.text_model_name)
self.text_model = CLIPTextModel.from_pretrained(self.text_model_name)
self.unet = diffusers.UNet2DConditionModel.from_pretrained(self.model_name, subfolder="unet")
self.scheduler = diffusers.DPMSolverMultistepScheduler.from_pretrained(
self.model_name, subfolder="scheduler",
algorithm_type="dpmsolver++", use_karras_sigmas=True,
)
Step 2 β Latent Encoding (encode_image)
Identical to encode_to_latent from Tutorial 01: resize β [-1, 1] normalize β VAE encode β multiply by 0.18215. No mask in this technique β the entire image is up for revision.
def encode_image(self, image):
preprocess = transforms.Compose([
transforms.Resize((512, 512)),
transforms.ToTensor(),
transforms.Normalize([0.5], [0.5]),
])
image = preprocess(image).unsqueeze(0).to(self.device, dtype=self.dtype)
latent = self.autoencoder.encode(image).latent_dist.sample()
return latent * 0.18215
Step 3 β Batched Text Embeddings (generate_noise_from_prompts)
π This is where the technique starts. We stack [uncond, cond, erase_1, erase_2, β¦, erase_N] along the batch axis so a single UNet forward pass yields all noise predictions in parallel.
text_input_embeddings = self.encode_text(text) # [1, 77, 768] β positive prompt
uncond = self.encode_text("") # [1, 77, 768] β empty prompt
erasure_list = [self.encode_text(p) for p in erasure_prompt]
erasure_input = torch.cat(erasure_list, dim=0) # [N, 77, 768] β banned concepts
# Final batch: [uncond, cond, erase_1, ..., erase_N] β shape [2 + N, 77, 768]
text_embeddings = torch.cat([uncond, text_input_embeddings, erasure_input], dim=0)
noise = torch.randn(latent_shape, device=self.device, dtype=self.dtype)
π Cheat sheet: the order [uncond, cond, *erase] is a contract β you'll chunk(2 + N) in the loop and access [0], [1], [2:] in exactly that order.
Step 4 β Setup Timesteps & Inject Noise
Same strength-based slicing pattern as Tutorial 01. unsqueeze(0) makes the start timestep a 1-D tensor so add_noise doesn't trip on a 0-D scalar with DPM++.
self.scheduler.set_timesteps(num_inference_steps, device=self.device)
init_timestep = min(int(num_inference_steps * strength), num_inference_steps)
timesteps = self.scheduler.timesteps[-init_timestep:]
start_timestep = timesteps[0].unsqueeze(0) # π 1-D tensor
latents = self.scheduler.add_noise(init_latent, noise, start_timestep)
Step 5 β The Core Hack: Multi-Negative Guidance Loop
Every step the UNet runs on a batched input of 2 + N copies of the same latent, each paired with a different text context. We split predictions into uncond, cond, and erase_1β¦N, then add the prompt direction and subtract each erasure direction.
total_chunks = 2 + len(erasure_prompt)
for t in timesteps:
# (A) Broadcast latents to the (2+N) batch so they pair with each text context
latent_model_input = torch.cat([latents] * total_chunks)
latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
# (B) ONE UNet call β all (2+N) noise predictions in parallel
noise_pred = self.unet(latent_model_input, t,
encoder_hidden_states=text_embeddings).sample
# (C) Split predictions in the SAME order they were stacked
all_preds = noise_pred.chunk(total_chunks)
noise_pred_uncond = all_preds[0]
noise_pred_cond = all_preds[1]
noise_pred_erase = all_preds[2:] # tuple of N tensors
# (D) π Standard CFG β pull TOWARD the prompt
guided_noise = noise_pred_uncond + guidance_scale * (noise_pred_cond - noise_pred_uncond)
# (E) π Concept Erasure β push AWAY from each banned concept
for noise_pred_e in noise_pred_erase:
guided_noise -= erase_scale * (noise_pred_e - noise_pred_uncond)
# (F) Step down to the next noise level
latents = self.scheduler.step(guided_noise, t, latents).prev_sample
π Cheat sheet β the equations that define this technique:
| Equation | |
|---|---|
| Standard CFG | noise = u + s Β· (c β u) |
Erasure term (per banned concept eα΅’) |
noise β= wα΅’ Β· (eα΅’ β u) |
| Combined | noise = u + sΒ·(c β u) β Ξ£α΅’ wα΅’Β·(eα΅’ β u) |
Geometric intuition. Each (x β u) is a direction vector in noise-prediction space pointing from "neutral" to that concept. CFG adds the prompt direction; erasure subtracts each banned direction. erase_scale is the magnitude of repulsion per banned concept.
Step 6 β Decoding (decode_from_latent)
Identical contract to Tutorial 01 β note this version squeeze(0)s the batch dim before permute(1, 2, 0), instead of permuting and indexing [0]:
def decode_from_latent(self, latent):
image = self.autoencoder.decode(latent / 0.18215).sample # π Undo VAE scaling
image = (image / 2 + 0.5).clamp(0, 1) # [-1,1] β [0,1]
image = image.cpu().squeeze(0).permute(1, 2, 0).float().numpy() # (C,H,W) β (H,W,C)
image = (image * 255).astype("uint8")
return PIL.Image.fromarray(image)
Step 7 β Execution (main)
ce = ConceptErasure()
init_image = PIL.Image.open("scene_erasure.png").convert("RGB")
prompt = "A road at night in the forest"
erasure_prompt = ["Streetlights", "Headlights", "Tail lights", "Lamps", "Artificial lights"]
result = ce.concept_erasure(
init_image=init_image,
prompt=prompt,
erasure_prompt=erasure_prompt,
num_inference_steps=50,
strength=0.3, # Light denoise β preserve overall scene
guidance_scale=7.5, # Normal CFG strength
erase_scale=10.0, # π Aggressive repulsion from banned concepts
)
result.save("output_concept_erased.jpg")
Knob-tuning cheat sheet (concept_erasure)
| Parameter | Range | Effect |
|---|---|---|
strength |
0.0β1.0 | Re-denoise depth. 0.3 keeps structure intact while suppressing concepts. High strength may destroy scene composition. |
guidance_scale |
1.0β15.0 | Strength of the positive prompt pull. |
erase_scale |
1.0β15.0 | π The new knob. Strength of the negative pull per erasure prompt. Higher = stronger erasure but more artifacts. |
erasure_prompt |
list[str] |
One concept per entry. Each adds +1 to the UNet batch size at every step. |
Tutorial 03 β DDIM Inversion (inversion_implemention.py)
The premise. Standard generation goes noise β image. DDIM Inversion runs the same denoising network in reverse β walking timesteps
0 β Tinstead ofT β 0β to recover the exact noise that would, when denoised, produce a given real image. That noise becomes a deterministic handle you can later re-denoise with a different prompt β the foundation of all real-image editing techniques.
Pipeline overview. Two loops, one UNet, two schedulers:
ββββββββββββββββββββββββββββββββββββ
real image ββΊβ Inversion loop (0 β T) β ββΊ inverted noise x_T
β DDIMInverseScheduler β
ββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββ
β Sampling loop (T β 0) β ββΊ reconstructed image
β DDIMScheduler β
ββββββββββββββββββββββββββββββββββββ
Step 1 β Initialization
Logic. Two schedulers, one UNet. DDIMInverseScheduler is built via .from_config() of the forward DDIMScheduler so they share the exact same Ξ±-bar schedule β this symmetry is what makes inversion mathematically reversible.
self.unet = UNet2DConditionModel.from_pretrained(self.model_name, subfolder="unet")
self.vae = AutoencoderKL.from_pretrained(self.model_name, subfolder="vae")
# 1. Standard scheduler for generation (T β 0)
self.noise_scheduler = DDIMScheduler.from_pretrained(self.model_name, subfolder="scheduler")
# 2. π Inverse scheduler for inversion math (0 β T) β same config = same Ξ±-schedule
self.inverse_scheduler = DDIMInverseScheduler.from_config(self.noise_scheduler.config)
π Cheat sheet: DDIM is the only standard SD scheduler that's deterministically invertible. DPM++, Euler, LMS β none of them round-trip cleanly.
Step 2 β Latent Encoding (get_latent_image)
Logic. Two important differences from Tutorial 01's encoder β both critical for a faithful round-trip:
init_latents = self.vae.encode(image_tensor).latent_dist.mode() # π .mode(), not .sample()
init_latents = init_latents * self.vae.config.scaling_factor # π from config, not hardcoded
π Cheat sheet:
.mode()returns the deterministic mean of the VAE's posterior β no random draw, so the round-trip is reproducible..sample()would inject noise on encode and break inversion.vae.config.scaling_factor == 0.18215for SD-v1.4, but reading from config is robust across model variants.
Step 3 β Text Embeddings (get_text_embeddings)
Logic. Just one embedding β no uncond, no CFG. Pure DDIM inversion is deterministic and runs a single text context. (Variants like Null-Text Inversion re-introduce CFG and optimize the uncond embedding β separate technique, separate tutorial.)
text_embeddings = self.text_encoder(**inputs).last_hidden_state # [1, 77, 768] β no uncond
Step 4 β The Forward Loop: Inversion (ddim_invers)
Logic. Walk timesteps 0 β T. At each step, predict noise with the UNet, then ask the inverse scheduler to push the latent one step further into noise.
self.inverse_scheduler.set_timesteps(num_inference_steps, device=self.device)
timesteps = self.inverse_scheduler.timesteps # π ascending: 0 β ~999
latents = init_latents.clone()
for idx, t in enumerate(timesteps):
noise_pred = self.unet(latents, t, encoder_hidden_states=text_embeddings).sample
# π .step() on the INVERSE scheduler walks FORWARD in time.
# The field is still called .prev_sample but it's now the NEXT (noisier) state.
latents = self.inverse_scheduler.step(noise_pred, t, latents).prev_sample
π Cheat sheet β leaky abstraction watch: DDIMInverseScheduler.step().prev_sample is misnamed β for the inverse scheduler it means "next-step output". The diffusers API reuses the field name; only the direction of travel reverses.
Step 5 β The Reverse Loop: Sampling (ddim_sampling)
Logic. Identical control flow, but with the forward scheduler. Start from the inverted noise (or any noise of the right shape) and walk T β 0 to recover an image.
self.noise_scheduler.set_timesteps(num_inference_steps, device=self.device)
timesteps = self.noise_scheduler.timesteps # descending: ~999 β 0
latents = inverted_latents.clone()
for idx, t in enumerate(timesteps):
noise_pred = self.unet(latents, t, encoder_hidden_states=text_embeddings).sample
latents = self.noise_scheduler.step(noise_pred, t, latents).prev_sample
π Cheat sheet: Same UNet, same prompt, opposite scheduler. The pair (invert β sample) should reconstruct the input up to small floating-point drift. If it doesn't β debug your encode (.mode()?), your scheduler (DDIM-only?), or your prompt (must match the source).
Step 6 β Decoding (vae_decoder)
Standard VAE round-trip β divide by 0.18215, decode, rescale, permute, return PIL.
def vae_decoder(self, latents):
latents = 1 / 0.18215 * latents
image = self.vae.decode(latents).sample
image = (image / 2 + 0.5).clamp(0, 1)
image = image.cpu().permute(0, 2, 3, 1).float().numpy()
return PIL.Image.fromarray((image[0] * 255).astype("uint8"))
Step 7 β Execution (main)
Invert real image β optionally snapshot intermediate latents β re-sample β save.
pipeline = InversionImplementationDDIM()
# 1. Inversion: real image β noise
inverted_noise, inversion_visuals = pipeline.ddim_invers(
num_inference_steps=50,
init_image="Road_in_Norway.jpg",
prompt="a photo of a road in norway",
visual_steps=[0, 1, 2], # capture early-step latents for debugging
)
# 2. Sampling: noise β reconstructed image
reconstructed_image, sampling_visuals = pipeline.ddim_sampling(
num_inference_steps=50,
inverted_latents=inverted_noise,
prompt="a photo of a road in norway",
visual_steps=[0, 1, 2],
)
reconstructed_image.save("reconstructed_final.jpg")
π Cheat sheet: The reconstruction quality is your inversion's report card. If the round-trip image differs visibly from the input β check (1) .mode() on encode, (2) DDIM schedulers on both sides, (3) same prompt + same step count both directions.
Editing flow
Inversion alone reconstructs. To edit, change the prompt during the sampling call:
inverted, _ = pipeline.ddim_invers (50, "road.jpg", prompt="a photo of a road in norway")
edited, _ = pipeline.ddim_sampling(50, inverted, prompt="a photo of a snowy road in norway")
Knob-tuning cheat sheet (inversion)
| Parameter | Range | Effect |
|---|---|---|
num_inference_steps |
50β200 | More steps = more faithful round-trip. Per-step error is smaller but compounds across more steps β a tradeoff. |
prompt |
str | Must describe the source image accurately. A mismatched prompt biases the inverted noise and degrades reconstruction. |
visual_steps |
list[int] |
Indices to capture for debugging the inversion trajectory. |
Tutorial 04 β Prompt-to-Prompt Attention Injection (prompt_to_prompt_attention.py)
The premise. Cross-attention maps inside the UNet encode which pixels each word of the prompt is paying attention to. If you save every cross-attention map from a source run, then overwrite them on a target run with a slightly different prompt β keeping the random seed identical β the spatial layout of the source carries over while the target prompt repaints the content within that layout.
No retraining. No extra forward passes. One tensor patched at the right place in the UNet's forward pass.
Pipeline overview. Two runs, same noise seed, two different attention processors:
seed=42 β Source prompt ββΊ SaveCrossAttnProcessor ββΊ source image + saved_maps[]
β
βΌ
seed=42 β Target prompt ββΊ InjectCrossAttnProcessor(saved_maps) ββΊ target image
(overrides cross-attn probs)
Step 0 β The architectural prerequisite
What is an attention processor? Every transformer block in the diffusers UNet routes its attention through a swappable AttnProcessor. The default one does standard QKV math. By subclassing and registering your own via unet.set_attn_processor(...), you get a hook inside every attention computation β and can either observe Q/K/V/probs or mutate them.
Cross-attention vs Self-attention. Inside __call__:
is_cross_attention = encoder_hidden_states is not None
- Self-attention β
encoder_hidden_states is Noneβ image attending to itself, controls texture/coherence. - Cross-attention β
encoder_hidden_statesis CLIP text embeddings β image attending to text, controls what is where. This is the only one P2P touches.
Step 1 β The Save Processor
Logic. Run the full standard attention computation, but right after computing attention_probs, snapshot the cross-attention probability matrix into a list.
class SaveCrossAttnProcessor:
def __init__(self):
self.attention_maps = []
def __call__(self, attn, hidden_states, encoder_hidden_states=None, attention_mask=None, **kwargs):
# Standard QKV math (Q from image, K/V from text for cross-attn)
query = attn.head_to_batch_dim(attn.to_q(hidden_states))
is_cross_attention = encoder_hidden_states is not None
if not is_cross_attention:
encoder_hidden_states = hidden_states
key = attn.head_to_batch_dim(attn.to_k(encoder_hidden_states))
value = attn.head_to_batch_dim(attn.to_v(encoder_hidden_states))
# Attention probabilities β shape [BΒ·heads, seq_image, seq_text] for cross-attn
attention_probs = attn.get_attention_scores(query, key, attention_mask)
# π The SAVE: only cross-attention, detached copy
if is_cross_attention:
self.attention_maps.append(attention_probs.detach().clone())
# Standard tail: weighted sum + output projection
hidden_states = torch.bmm(attention_probs, value)
hidden_states = attn.batch_to_head_dim(hidden_states)
hidden_states = attn.to_out[0](hidden_states)
return hidden_states
π Cheat sheet:
- Cross-attention probs shape:
[BΒ·heads, seq_image, seq_text](e.g.[16, 4096, 77]for batch 2 Γ 8 heads Γ64Γ64tokens Γ 77 CLIP tokens). .detach().clone()β detach from autograd, clone so the original tensor's storage can be freed by the next step.- The order of saved maps is: outer = timestep, inner = layer-by-layer in UNet forward order. The inject processor consumes them in exactly the same order.
Step 2 β The Inject Processor
Logic. Identical QKV computation, but right before the weighted sum, replace the freshly computed attention_probs with the corresponding saved map.
class InjectCrossAttnProcessor:
def __init__(self, saved_maps, injection_ratio=0.8):
self.saved_maps = saved_maps
self.injection_ratio = injection_ratio
self.step = 0
def __call__(self, attn, hidden_states, encoder_hidden_states=None, attention_mask=None, **kwargs):
query = attn.head_to_batch_dim(attn.to_q(hidden_states))
is_cross_attention = encoder_hidden_states is not None
if not is_cross_attention:
encoder_hidden_states = hidden_states
key = attn.head_to_batch_dim(attn.to_k(encoder_hidden_states))
value = attn.head_to_batch_dim(attn.to_v(encoder_hidden_states))
attention_probs = attn.get_attention_scores(query, key, attention_mask)
# π The OVERRIDE: swap target's probs for the source's saved probs
if is_cross_attention:
if self.step < len(self.saved_maps):
attention_probs = self.saved_maps[self.step]
self.step += 1
hidden_states = torch.bmm(attention_probs, value)
hidden_states = attn.batch_to_head_dim(hidden_states)
hidden_states = attn.to_out[0](hidden_states)
return hidden_states
π Note on injection_ratio. In the current implementation, the field is stored but unused β every step where a saved map exists gets overridden. To get the canonical P2P behavior (inject only in the first N% of steps so the model can refine texture freely at the end), change the condition to:
total = len(self.saved_maps)
if self.step < int(self.injection_ratio * total):
attention_probs = self.saved_maps[self.step]
self.step += 1
Early-step attention controls layout; late-step attention refines texture. Limiting injection to early steps preserves the source's geometry while letting the target prompt repaint detail.
Step 3 β Execution (main)
Logic. Two pipeline calls, same locked seed. Between them, swap the processor on the UNet.
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16
).to(device)
prompt_source = "A driving dashcam view of a sunny road in Norway"
prompt_target = "A driving dashcam view of a snowy road in Norway"
# βββ Source run βββββββββββββββββββββββββββββββββββββββββ
generator = torch.manual_seed(42) # π LOCK SEED
save_processor = SaveCrossAttnProcessor()
pipe.unet.set_attn_processor(save_processor) # π inject the SAVE hook
source_image = pipe(prompt_source, generator=generator, num_inference_steps=50).images[0]
# βββ Target run βββββββββββββββββββββββββββββββββββββββββ
generator = torch.manual_seed(42) # π SAME SEED β same initial noise
inject_processor = InjectCrossAttnProcessor(saved_maps=save_processor.attention_maps)
pipe.unet.set_attn_processor(inject_processor) # π swap to INJECT hook
target_image = pipe(prompt_target, generator=generator, num_inference_steps=50).images[0]
π Why the seed lock matters. P2P relies on the initial latent noise being identical between runs. Different noise β different geometry from step 1 β the saved cross-attention maps no longer correspond to anything in the target run's spatial layout.
π Why the target prompt should be a minimal edit. P2P only carries over spatial layout. If prompt_target is structurally very different from prompt_source (changing nouns, verbs, and composition at once), the injected maps will fight the target prompt and you'll get artifacts. Word-level swaps and adjective changes β clean results.
Knob-tuning cheat sheet (prompt-to-prompt)
| Parameter | Range | Effect |
|---|---|---|
seed |
int | Must match between source and target runs. Different seed = broken layout transfer. |
injection_ratio |
0.0β1.0 | Fraction of steps to inject. Lower = looser layout, more target-texture freedom. Currently dead code β patch as shown above. |
prompt_target |
str | Should be a minimal edit of prompt_source (one or two word swaps). |
num_inference_steps |
20β50 | Must match between runs so saved_maps indexing lines up. |
Where to extend this
- Word-level swap maps β instead of dumping every map, weight specific source-word columns ("sunny") onto specific target-word columns ("snowy") of the probs.
- Map reweighting β scale specific text-token columns up or down to amplify/suppress concepts without swapping prompts.
- Layer-selective injection β only inject at certain UNet resolutions (low-res down-blocks for global layout, high-res up-blocks for detail).
Comparison β at a glance
| Blended Diffusion | Concept Erasure | DDIM Inversion | Prompt-to-Prompt | |
|---|---|---|---|---|
| Hack lives in | Latents | Noise prediction | Scheduler direction | UNet attention |
| Constraint domain | Spatial (mask) | Semantic (text vectors) | Temporal (time-reversal) | Architectural (attention maps) |
| Batch size during loop | 2 (uncond + cond) |
2 + N (+ N erase) |
1 (no CFG) |
2 (uncond + cond) β but twice |
| Extra forward passes | 0 | 0 (batched) | 0 | +1 full generation (source run) |
| What it enables | Localized regional edits | Suppressing hallucinated concepts | Real-image editing & reconstruction | Semantic edits preserving layout |
| Key equation / mechanic | latent = maskΒ·fg + (1βmask)Β·bg |
noise = u + sΒ·(cβu) β Ξ£α΅’ wα΅’Β·(eα΅’βu) |
x_{t+1} = inverse_step(x_t, Ξ΅_ΞΈ) |
attention_probs β saved_maps[step] |
| Determinism | Stochastic (noise sample) | Stochastic | Deterministic | Stochastic, but seed-locked |
Requirements
- Python β₯ 3.10
torch,diffusers,transformers,torchvision,pillow,numpy- CUDA GPU recommended (CPU works in
float32but is slow)
Why this repo exists
Most tutorials wrap StableDiffusionPipeline and call .generate(). This repo does the opposite: every script rebuilds a capability from its raw building blocks so that the loop, the scheduler, the CFG math, and the VAE contract are all visible and editable. If you can read these scripts end-to-end, you can modify any diffusion pipeline.
License
Educational reference. Model weights (CompVis/stable-diffusion-v1-4, openai/clip-vit-large-patch14) follow their respective licenses on the Hugging Face Hub.