comfyui

Flash attention optimization for significant speedup. - old title: Optimization tips to maximize generation speed?

#6
by eepos - opened

EDIT: For a big boost in generation speed:

I am seeing about a 50% speedup on a 5090.


Hiya,

The model is quite demanding, even on a 5090, generating a ~2 Mpix image takes over a minute.

I've tried KJ Sage attention patcher, torch.compile and neither seem to work or do anything. Could be a user error.

Easycache works but decreases quality by smoothing out details which is what this model really shines at.

Any good tips to make it run faster?

Thanks for supporting this model, it's awesome!

Oh, I just noticed the new Patch Flash Attention KJ node and gave it a try.

Down to ~1.7s/it from ~2.6s/it on a 2048x1088 image, using fp8. Holy moly!

Yeah I figured flash attention actually gives considerable speed boost over sdpa:

torch 2.12.0+cu132 | NVIDIA GeForce RTX 4090 | flash_attn: True
----------------------------------------------------------------------------------------------------
B=1 H=18 L= 6385 D=256 | mem_eff=  9.007ms  math=  46.183ms  flash=  5.221ms  | mem_eff/flash= 1.73x
B=1 H=18 L= 6385 D=128 | mem_eff=  3.852ms  math=  34.916ms  flash=  2.626ms  | mem_eff/flash= 1.47x
B=1 H=18 L= 6385 D= 64 | mem_eff=  1.564ms  math=  33.437ms  flash=  1.356ms  | mem_eff/flash= 1.15x
----------------------------------------------------------------------------------------------------
B=1 H=24 L= 4096 D=128 | mem_eff=  2.015ms  math=  18.854ms  flash=  1.280ms  | mem_eff/flash= 1.57x
B=1 H=24 L= 4096 D= 64 | mem_eff=  0.829ms  math=  17.414ms  flash=  0.662ms  | mem_eff/flash= 1.25x

Lots of easy install wheels available here: https://mjunya.com/flash-attention-prebuild-wheels/

Sageattention can't work due to the model using head dim 256, sage only supports up to 128.

Another small boost and peak VRAM reduction is chunking the FFN and running RoPE in bf16 (will have to see if this can be default), available to test via this node:

image

You can also pretty safely run the uncond model as nvfp4.

EasyCache has pretty bad quality hit, but if you update ComfyUI to latest nightly, you can try using the CFG Override to drop cfg to 1.0, so those steps are twice as fast, that didn't seem to hit the quality as much.

eepos changed discussion title from Optimization tips to maximize generation speed? to Flash attention optimization for significant speedup. - old title: Optimization tips to maximize generation speed?

Sign up or log in to comment