Flash attention optimization for significant speedup. - old title: Optimization tips to maximize generation speed?
EDIT: For a big boost in generation speed:
- update KJ nodes
- install flash-attn if you don't already have it ( https://mjunya.com/flash-attention-prebuild-wheels/ )
- use "Patch Flash Attention KJ"
I am seeing about a 50% speedup on a 5090.
Hiya,
The model is quite demanding, even on a 5090, generating a ~2 Mpix image takes over a minute.
I've tried KJ Sage attention patcher, torch.compile and neither seem to work or do anything. Could be a user error.
Easycache works but decreases quality by smoothing out details which is what this model really shines at.
Any good tips to make it run faster?
Thanks for supporting this model, it's awesome!
Oh, I just noticed the new Patch Flash Attention KJ node and gave it a try.
Down to ~1.7s/it from ~2.6s/it on a 2048x1088 image, using fp8. Holy moly!
Yeah I figured flash attention actually gives considerable speed boost over sdpa:
torch 2.12.0+cu132 | NVIDIA GeForce RTX 4090 | flash_attn: True
----------------------------------------------------------------------------------------------------
B=1 H=18 L= 6385 D=256 | mem_eff= 9.007ms math= 46.183ms flash= 5.221ms | mem_eff/flash= 1.73x
B=1 H=18 L= 6385 D=128 | mem_eff= 3.852ms math= 34.916ms flash= 2.626ms | mem_eff/flash= 1.47x
B=1 H=18 L= 6385 D= 64 | mem_eff= 1.564ms math= 33.437ms flash= 1.356ms | mem_eff/flash= 1.15x
----------------------------------------------------------------------------------------------------
B=1 H=24 L= 4096 D=128 | mem_eff= 2.015ms math= 18.854ms flash= 1.280ms | mem_eff/flash= 1.57x
B=1 H=24 L= 4096 D= 64 | mem_eff= 0.829ms math= 17.414ms flash= 0.662ms | mem_eff/flash= 1.25x
Lots of easy install wheels available here: https://mjunya.com/flash-attention-prebuild-wheels/
Sageattention can't work due to the model using head dim 256, sage only supports up to 128.
Another small boost and peak VRAM reduction is chunking the FFN and running RoPE in bf16 (will have to see if this can be default), available to test via this node:
You can also pretty safely run the uncond model as nvfp4.
EasyCache has pretty bad quality hit, but if you update ComfyUI to latest nightly, you can try using the CFG Override to drop cfg to 1.0, so those steps are twice as fast, that didn't seem to hit the quality as much.
