A little intuition on how the underlying on how unclip_image_interpolation works?

by mikegarts - opened

I see that it relies on UnCLIPImageInterpolationPipeline (which is similar to UnCLIPImageVariationPipeline ), but is there any explanation about the 'intuition' behind this I can look into?
Thanks :)

@mikegarts sure ! Happy to clarify.

  • We generate the CLIP embeddings of the two input images using the CLIPVisionModelWithProjection. Let them be z_start and z_end.
  • We interpolate between z_start and z_end using spherical interpolation. a.k.a. slerp(https://en.wikipedia.org/wiki/Slerp). Number of interpolated embeddings generated = Number of steps of the pipeline. Let's say z_1, ... z_N
  • We pass the embeddings z_1, ... z_N to the decoder which generates the images you see in the output.

Dall - E 2 paper. See Figure 3. https://cdn.openai.com/papers/dall-e-2.pdf
See our discussion on https://github.com/huggingface/diffusers/issues/1869
Community Pipeline code: https://github.com/huggingface/diffusers/blob/main/examples/community/unclip_image_interpolation.py

It uses the image variations checkpoint because it has the image_encoder (CLIPVisionModelWithProjection) weights that I can use.

Do let me know if you have any doubts on this.

Put briefly, we are encoding the images into the latent space and walking N steps between them in the latent space. At each step we sample the latent space and generate an image to see what lies there.

@NagaSaiAbhinay Thanks a lot, after reading the discussion thread it all makes much more sense :)

Btw, recently I pushed the controlnet_img2img pipeline and played with it a little and it produces somewhat more 'contextual' transition between two prompts/images (in a similar way - by interpolating in the embedding space + using img2img and controlnet to constraint the next image).

For example:
source_prompt = "a beautiful tabby cat"
dest_prompt = "an astronaut on a horse"
can yield something like that:

Sign up or log in to comment