K_lms sampler integration

#4
by terekita - opened

I've been trying to compare gens from this codebase with gens on the discord server and have had the persistent feeling that the ones from discord have smore "clarity." I noticed someone had the same question on github, and also noted that the defaul k_lms sampler on discord behaves better under a wider range of cfg values.

The k_lms sampler is apparently available at https://github.com/crowsonkb/k-diffusion

Is there an easy way to take that sampler implmentation from KCrownson's github and integrate it with txt2img?

I asked this same question on github and will update here if there is anything helpful.

Hey @terekita ,

Thanks for the heads-up - we're looking into it :-)

  1. We'll first check how our vanilla scheduler (PNDM) compares with the one of https://github.com/CompVis/stable-diffusion
  2. We'll see with @Katherine how we can best implement her scheduler!

Also cc @valhalla and @anton-l

Doing some live debugging here:

Executing the following line of code with https://github.com/CompVis/stable-diffusion:

python scripts/txt2img.py --prompt "a photograph of an astronaut riding a horse" --plms --n_samples 4 --n_iter 4

gives

"a photograph of an astronaut riding a horse":
grid-0000.png

"a photograph of the eiffel tower on the moon":
grid-0000 (2).png

"an oil painting of a futuristic forest gives":
grid-0001 (1).png

Now running the following lines of code with https://github.com/huggingface/diffusers:

from diffusers import StableDiffusionPipeline
from time import time
from PIL import Image
from einops import rearrange
import numpy as np
import torch
from torch import autocast
from torchvision.utils import make_grid

torch.manual_seed(42)

#prompt = "a photograph of an astronaut riding a horse"
#prompt = "a photograph of the eiffel tower on the moon"
#prompt = "an oil painting of a futuristic forest gives"

pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-3-diffusers", use_auth_token=True)  # make sure you're logged in with `huggingface-cli login`

all_images = []
num_rows = 4
num_columns = 4
for _ in range(num_rows):
    with autocast("cuda"):
        images = pipe(num_columns * [prompt], guidance_scale=7.5, output_type="np")["sample"]  # image here is in [PIL format](https://pillow.readthedocs.io/en/stable/)
        all_images.append(torch.from_numpy(images))

# additionally, save as grid
grid = torch.stack(all_images, 0)
grid = rearrange(grid, 'n b h w c -> (n b) h w c')
grid = rearrange(grid, 'n h w c -> n c h w')
grid = make_grid(grid, nrow=num_rows)

# to image
grid = 255. * rearrange(grid, 'c h w -> h w c').cpu().numpy()
image = Image.fromarray(grid.astype(np.uint8))

image.save(f"./images/diffusers/{'_'.join(prompt.split())}_{round(time())}.png")

gives

"a photograph of an astronaut riding a horse":
a_photograph_of_an_astronaut_riding_a_horse_1660558873 (1).png

"a photograph of the eiffel tower on the moon":
a_photograph_of_the_eiffel_tower_on_the_moon_1660559024.png

"an oil painting of a futuristic forest gives":
an_oil_painting_of_a_futuristic_forest_gives_1660559297.png

This comment has been hidden
This comment has been hidden
This comment has been hidden

Thanks very much for looking into this! The oil painting grid from the diffusers code shows the clarity thing I'm talking about. Overall the images have sharper contrast, whereas some of the stable-diffusion repo images from the same prompt have a kind of washed out appearance (compare the very bottom right corner image from both sets, for example). It may be that I'm noticing this more because of the kinds of prompts I'm working with (which are exploring painting primarily).

It is difficult to determine if there are differences, the same seed between the two algorithms on the same hardware reports different results, furthermore by default Diffusers has 59 steps from what you see during processing, while txt2img.py 50 steps.

Agree that it's hard to assess with the same seed returning different results. I was basing my comments on having generated thousands of images on the discord server and several hundred with the recently available weights. I could be wrong, but I'm pretty sure the difference in contrast in the two sets of images above is fairly representative of prompts involving painting. I try and compare running the same amount of steps...

From the bot's help listing, it looks like there are a bunch of different samplers that have been implemented:

SAMPLER [k_lms] (ddim, plms, k_euler, k_euler_ancestral, k_heun, k_dpm_2, k_dpm_2_ancestral, k_lms)

It seems that the code that the bot is using has been developed further than the public code that we are using. Does anyone here know if we can get access to the bot's sampler code?

Sign up or log in to comment