Diffusion Single File
comfyui

NVFP4 / MXFP4 / FP8 quantizations for faster inference

#173
by Iwaku-Real - opened

The current Anima model is only available in BF16 format and it's over twice as slow compared to Illustrious SDXL even on my RTX 5070 Ti:
image
At 4-bit and 8-bit quantization, it could not only fit into lower VRAM but on certain hardware (like RTX 50xx which supports NVFP4) it's hardware-accelerated and can make generation up to 2x faster for 8-bit and 4x faster for 4-bit. While there will always be very minor quality loss, this quantization enables the use of negative prompts, unlike the Turbo LoRA which nullifies them.

FLUX.2 has such an approach officially available: https://huggingface.co/collections/black-forest-labs/flux2
Also could work with Nunchaku on this, they have their own super effective FP4 quantization method for models like FLUX, Z-Image, and Qwen-Image: https://github.com/nunchaku-ai/nunchaku

About FP8/MXFP8 model, I couldn't use torch.compile and it means it's slower than BF16.
To use torch.compile on FP8/MXFP8 models, use TorchCompileModelAdvanced from KJNodes, and set to max-autotune-no-cudagraphs mode and dynamic to false.

Try INT8 model with INT8-Fast custom node. It seems best way to boost the generation speed (and it works on turing and ampere GPUs)

About NVFP4, I couldn't satisfied it's quality.
nunchaku maybe good to try, because it calibrate, but not sure usual people not easily doing it (it needs resources=money and time) and post-training means the artist tags works differently.

Sign up or log in to comment