multimodalart/stable-cascade · can you add image to image and image variation too please to gradio app?

I'm trying to do this in a private clone of the app. It's a little frustrating. There is a clear example in Notebook form at https://github.com/Stability-AI/StableCascade under inference/text_to_image.ipynb with examples of image variation and image to image that show how to do it using torch and tqdm along with other code from the repo.

The difference between text to image and image to image is only in the setup as shown in the diff below. You specify a noise level and use something called gdf.diffuse to produce an initial noised latent from the input image based on a specified noise level. The noise level value also figures in the sampling configuration for timesteps, starting timestep and x_init, which is set to the noised latent instead of a pure noise image.

What's frustrating is trying to figure out how to incorporate this into the Gradio + Diffusers framework. I have a feeling it's trivially easy for someone who knows their way around the framework. I don't. At least not yet.

Anyway, if I get it working I'll happily share the code - though I can't afford to provide free GPU time to all and sundry (unless HF were to grant me access to Zero, hint, hint)

diff t2i.py i2i.py
> noise_level = 0.8
> effnet_latents = core.encode_latents(batch, models, extras)
> t = torch.ones(effnet_latents.size(0), device=device) * noise_level
> noised = extras.gdf.diffuse(effnet_latents, t=t)[0]

10,11c13,15
< extras.sampling_configs['timesteps'] = 20
< extras.sampling_configs['t_start'] = 1.0
---
> extras.sampling_configs['timesteps'] = int(20 * noise_level)
> extras.sampling_configs['t_start'] = noise_level
> extras.sampling_configs['x_init'] = noised
20c24
< batch = {'captions': [caption] * batch_size}
---
> batch['captions'] = [caption] * batch_size