Classifier-free guidance resolution weighting

#51
by snatchysquid - opened

In section 3.4, the ControlNet paper talks about CFG-RW, quoting:

In challenging cases, e.g., when no prompts are given, adding it to both ϵuc and ϵc will completely remove CFG guidance (Figure 5b);
using only ϵc will make the guidance very strong (Figure 5c).
Our solution is to first add the conditioning image to ϵ_c and then multiply a weight wi to each connection between Stable Diffusion and ControlNet according to the resolution of each block wi = 64/hi, where hi is the size of i th block, e.g., h1 = 8, h2 = 16, ..., h13 = 64

I don't quite understand what this means and where is this implemented in the code in the github repository. My questions, therefore, are as follows:

  1. Do I understand correctly that what we do is train the model without any weighting, and then for the ϵ_uc we use unconditional SD without ControlNet, and for ϵ_c we use ControlNet but before adding the skip connection, we multiply the output by wi (meaning SD_Layer_i_final_output = SD_Layer_i_output + (w_i*ControlNet_Layer_i_output) )?
  2. If so, what is the logic and motivation for doing that? It doesn't sound trivial that this would be want we want to do.
  3. Finally, I'd like to know where is this implemented in the code?

Sign up or log in to comment