FateZero / FateZero /docs /EditingGuidance.md
chenyangqi's picture
add FateZero code


Prompt Engineering

For the results in the paper and webpage, we get the source prompt using the BLIP model embedded in the Stable Diffusion WebUI.

Click the "interrogate CLIP", and we will get a source prompt automatically. Then, we remove the last few useless words.

During stylization, you may use a very simple source prompt "A photo" as a baseline if your input video is too complicated to describe by one sentence.

Validate the prompt

  • Put the source prompt into the stable diffusion. If the generated image is close to our input video, it can be a good source prompt.
  • A good prompt describes each frame and most objects in video. Especially, it has the object or attribute that we want to edit or preserve.
  • Put the target prompt into the stable diffusion. We can check the upper bound of our editing effect. A reasonable composition of video may achieve better results(e.g., "sunflower" video with "Van Gogh" prompt is better than "sunflower" with "Monet")

FateZero hyperparameters

We give a simple analysis of the involved hyperparaters as follows:

# Whether to directly copy the cross attention from source 
# True: directly copy, better for object replacement
# False: keep source attention, better for style
is_replace_controller: False

# Semantic layout preserving. High steps, replace more cross attention to preserve semantic layout
    default_: 0.8

# Source background structure preserving, in [0, 1]. 
# e.g., =0.6 Replace the first 60% steps self-attention
self_replace_steps: 0.8

# Amplify the target-words cross attention, larger value, more close to target
# eq_params: 
#     words: ["", ""]
#     values: [10,10] 

# Target structure-divergence hyperparames
# If you change the shape of object, it is better to use all three line; otherwise, no need.
# Without following three lines, all self-attention will be replaced
blend_words: [['jeep',], ["car",]] 
masked_self_attention:  True
# masked_latents: False   # Directly copy the latents, performance not so good in our case            
bend_th: [2, 2]
# preserve source structure of blend_words in [0, 1]
# default is bend_th: [2, 2]  # replace full-resolution edit source with self-attention 
# bend_th-> [0.0, 0.0], mask -> 1, use more edit self-attention, more generated shape, less source acttention

DDIM hyperparameters

We profile the cost of editing 8 frames on an Nvidia 3090, fp16 of accelerator, xformers.

Configs Attention location DDIM Inver. Step CPU memory GPU memory Inversion time Editing time time Quality
basic RAM 50 100G 12G 60s 40s Full support
low cost RAM 10 15G 12G 10s 10s OK for Style, not work for shape
lower cost DISK 10 6G 12G 33 s 100s OK for Style, not work for shape