Does 24GB of graphics memory suffice for inference on this model?
#1
by
taiao
- opened
Is the following code correct?
transformer = FluxTransformer2DModel.from_pretrained(bfl_repo2, subfolder="transformer",
torch_dtype=torch.float8_e4m3fn, revision=None)
text_encoder_2 = T5EncoderModel.from_pretrained(bfl_repo, subfolder="text_encoder_2", torch_dtype=torch.float16,
revision=None)
tokenizer_2 = T5TokenizerFast.from_pretrained(bfl_repo, subfolder="tokenizer_2", torch_dtype=torch.float16,
revision=None)
# print(datetime.datetime.now(), "Quantizing text encoder 2")
# quantize(text_encoder_2, weights=qfloat8)
# freeze(text_encoder_2)
flux_pipe = FluxPipeline.from_pretrained(bfl_repo2,
text_encoder_2=text_encoder_2, tokenizer_2=tokenizer_2, transformer=transformer, token=None)
flux_pipe.enable_model_cpu_offload()
error:
RuntimeError: mat1 and mat2 must have the same dtype, but got Half and Float8_e4m3fn
mat1 and mat2 must have the same dtype, but got Half and Float8_e4m3fn
Follow this error message, it is correct to fix it like this.
transformer = FluxTransformer2DModel.from_pretrained(bfl_repo2, subfolder="transformer",
torch_dtype=torch.float8_e4m3fn, revision=None)
text_encoder_2 = T5EncoderModel.from_pretrained(bfl_repo, subfolder="text_encoder_2", torch_dtype=torch.float8_e4m3fn,
revision=None)
tokenizer_2 = T5TokenizerFast.from_pretrained(bfl_repo, subfolder="tokenizer_2", torch_dtype=torch.float8_e4m3fn,
revision=None)
# print(datetime.datetime.now(), "Quantizing text encoder 2")
# quantize(text_encoder_2, weights=qfloat8)
# freeze(text_encoder_2)
flux_pipe = FluxPipeline.from_pretrained(bfl_repo2,
text_encoder_2=text_encoder_2, tokenizer_2=tokenizer_2, transformer=transformer, token=None)
flux_pipe.enable_model_cpu_offload()
or
from diffusers import DiffusionPipeline
bfl_repo = "John6666/hyper-flux1-dev-fp8-flux"
flux_pipe = DiffusionPipeline.from_pretrained(bfl_repo, torch_dtype=torch.float8_e4m3fn)
flux_pipe.enable_model_cpu_offload()
Does 24GB of graphics memory suffice for inference on this model?
My VRAM is only 8GB, so I'm not sure since I've never tried to run it locally.
However, from looking at outside forums, it seems that to save VRAM, you can use NF4 or GGUF 4bit for both accuracy and speed with half the memory.
Quinto's qfloat8 is not bad though.
https://github.com/lllyasviel/stable-diffusion-webui-forge/discussions/981
How to use NF4 quantized FLUX.1 from Diffusers in Zero GPU space:
https://huggingface.co/spaces/nyanko7/flux1-dev-nf4/blob/main/app.py
https://huggingface.co/spaces/nyanko7/flux1-dev-nf4