Prior not working with float16 - output all NaN's

#3
by Vargol - opened

Using your original example code I tried using float16 instead of bfloat16 and was granted with blank images.
Further checking showed I was getting all NaN's for the prior output.
As I don't have access to a GPU that does bfloat16 or is big enough to run the prior at float32 I'm pretty disappointed with this.

Yes, we're aware of this. We'll work to solve it but at this moment this is a known issue.

fp16 not working i have reported this issue
bf16 working fine for me

Yes, we're aware of this. We'll work to solve it but at this moment this is a known issue.

@babbleberns

Hello again. People asking me to use on older GPUs such as 1080 TI. it has 12 GB VRAM so it would work great with cpu off loading. But it requires FP16

When can you fix? Also for Kaggle we need FP16.

Moreover, which models are auto loaded? for model C 1B or 3.6B? for model B 700M or 1.5B?

@MonsterMMORPG Using the Diffusers scripts, going by the file sizes in the cache, the default versions are the bfloat16 versions of the full sized models.

@MonsterMMORPG Using the Diffusers scripts, going by the file sizes in the cache, the default versions are the bfloat16 versions of the full sized models.

thanks for clarification.

I don't know if this helps, but I traced the issue to _up_decode in the unet (in pipelines/stable_cascade/modeling_stable_cascade_common.py), specifically while enumerating the block group. It doesn't happen immediately but after several passes through it outputs a tensor with all NaN. Seems to fail here every time:

elif isinstance(block, AttnBlock): x = block(x, clip)

It could of course be something in the previous step through this block that causes the issue. But it looks like maybe it's blowing out a buffer since fp16 isn't as precise as bf16?

EDIT: I should mention I'm running the mps backend, so my issue may be specific to Mac devices.

Also, effnet seems to be a tensor with all NaN earlier in the unet forward pass, even though it should default to None. I had to comment that out to get to this point. That may have caused the blowout as well, if that was an important step when manipulating the latents.

FWIW, the model works fine with float32. The mps acceleration on Apple Silicon doesn't support bfloat16 but the model runs fine (if at half speed and with twice the RAM usage) by applying .to(torch.float32) before loading it onto the device.

Pretty sure the requirements for running the model at f32 would exceed my 8GB gpu, so I guess I'll just wait on a patch.

we made it work with both cpu offloading, and fp16
i even published a free kaggle notebook that works with fp16 with a gradio interface

Sign up or log in to comment