Flash attention optimization for significant speedup. - old title: Optimization tips to maximize generation speed?

by eepos - opened about 8 hours ago

Discussion

eepos

about 8 hours ago

•

edited about 6 hours ago

EDIT: For a big boost in generation speed:

update KJ nodes
install flash-attn if you don't already have it ( https://mjunya.com/flash-attention-prebuild-wheels/ )
use "Patch Flash Attention KJ"

I am seeing about a 50% speedup on a 5090.

Hiya,

The model is quite demanding, even on a 5090, generating a ~2 Mpix image takes over a minute.

I've tried KJ Sage attention patcher, torch.compile and neither seem to work or do anything. Could be a user error.

Easycache works but decreases quality by smoothing out details which is what this model really shines at.

Any good tips to make it run faster?

Thanks for supporting this model, it's awesome!

eepos

about 8 hours ago

Oh, I just noticed the new Patch Flash Attention KJ node and gave it a try.

Down to ~1.7s/it from ~2.6s/it on a 2048x1088 image, using fp8. Holy moly!

Kijai

Comfy Org org about 7 hours ago

•

edited about 6 hours ago

Yeah I figured flash attention actually gives considerable speed boost over sdpa:

torch 2.12.0+cu132 | NVIDIA GeForce RTX 4090 | flash_attn: True
----------------------------------------------------------------------------------------------------
B=1 H=18 L= 6385 D=256 | mem_eff=  9.007ms  math=  46.183ms  flash=  5.221ms  | mem_eff/flash= 1.73x
B=1 H=18 L= 6385 D=128 | mem_eff=  3.852ms  math=  34.916ms  flash=  2.626ms  | mem_eff/flash= 1.47x
B=1 H=18 L= 6385 D= 64 | mem_eff=  1.564ms  math=  33.437ms  flash=  1.356ms  | mem_eff/flash= 1.15x
----------------------------------------------------------------------------------------------------
B=1 H=24 L= 4096 D=128 | mem_eff=  2.015ms  math=  18.854ms  flash=  1.280ms  | mem_eff/flash= 1.57x
B=1 H=24 L= 4096 D= 64 | mem_eff=  0.829ms  math=  17.414ms  flash=  0.662ms  | mem_eff/flash= 1.25x

Lots of easy install wheels available here: https://mjunya.com/flash-attention-prebuild-wheels/

Sageattention can't work due to the model using head dim 256, sage only supports up to 128.

Another small boost and peak VRAM reduction is chunking the FFN and running RoPE in bf16 (will have to see if this can be default), available to test via this node:

You can also pretty safely run the uncond model as nvfp4.

EasyCache has pretty bad quality hit, but if you update ComfyUI to latest nightly, you can try using the CFG Override to drop cfg to 1.0, so those steps are twice as fast, that didn't seem to hit the quality as much.

eepos changed discussion title from Optimization tips to maximize generation speed? to Flash attention optimization for significant speedup. - old title: Optimization tips to maximize generation speed? about 6 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment