GGUF
The GGUF file format is typically used to store models for inference with GGML and supports a variety of block wise quantization options. Diffusers supports loading checkpoints prequantized and saved in the GGUF format via from_single_file
loading with Model classes. Loading GGUF checkpoints via Pipelines is currently not supported.
The following example will load the FLUX.1 DEV transformer model using the GGUF Q2_K quantization variant.
Before starting please install gguf in your environment
pip install -U gguf
Since GGUF is a single file format, use ~FromSingleFileMixin.from_single_file
to load the model and pass in the GGUFQuantizationConfig.
When using GGUF checkpoints, the quantized weights remain in a low memory dtype
(typically torch.unint8
) and are dynamically dequantized and cast to the configured compute_dtype
during each module’s forward pass through the model. The GGUFQuantizationConfig
allows you to set the compute_dtype
.
The functions used for dynamic dequantizatation are based on the great work done by city96, who created the Pytorch ports of the original (numpy
)[https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/quants.py] implementation by compilade.
import torch
from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig
ckpt_path = (
"https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-Q2_K.gguf"
)
transformer = FluxTransformer2DModel.from_single_file(
ckpt_path,
quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
torch_dtype=torch.bfloat16,
)
pipe = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
transformer=transformer,
generator=torch.manual_seed(0),
torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload()
prompt = "A cat holding a sign that says hello world"
image = pipe(prompt).images[0]
image.save("flux-gguf.png")
Supported Quantization Types
- BF16
- Q4_0
- Q4_1
- Q5_0
- Q5_1
- Q8_0
- Q2_K
- Q3_K
- Q4_K
- Q5_K
- Q6_K