AutoRound

AutoRound is an advanced quantization toolkit. It achieves high accuracy at ultra-low bit widths (2-4 bits) with minimal tuning by leveraging sign-gradient descent and providing broad hardware compatibility. See our papers SignRoundV1 and SignRoundV2 for more details.

Install auto-round(version ≥ 0.13.0):

pip install "auto-round>=0.13.0"

To use the Marlin kernel for faster CUDA inference, install gptqmodel:

pip install "gptqmodel>=5.8.0"

Load a quantized model

Load a pre-quantized AutoRound model by passing AutoRoundConfig to from_pretrained(). The method works with any model that loads via Accelerate and has torch.nn.Linear layers.

You can use PipelineQuantizationConfig to quantize specific components of a pipeline:

import torch
from diffusers import DiffusionPipeline, PipelineQuantizationConfig, AutoRoundConfig

pipeline_quant_config = PipelineQuantizationConfig(
    quant_mapping={"transformer": AutoRoundConfig(backend="auto")}
)
pipe = DiffusionPipeline.from_pretrained(
    "INCModel/Z-Image-W4A16-AutoRound",
    quantization_config=pipeline_quant_config,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)

image = pipe("a cat holding a sign that says hello").images[0]
image.save("output.png")

Or load a quantized model component directly:

import torch
from diffusers import ZImageTransformer2DModel, ZImagePipeline, AutoRoundConfig

model_id = "INCModel/Z-Image-W4A16-AutoRound"

quantization_config = AutoRoundConfig(backend="auto")
transformer = ZImageTransformer2DModel.from_pretrained(
    model_id,
    subfolder="transformer",
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)

pipe = ZImagePipeline.from_pretrained(
    model_id,
    transformer=transformer,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)

image = pipe("a cat holding a sign that says hello").images[0]
image.save("output.png")

AutoRound in Diffusers only supports loading pre-quantized models. To quantize a model from scratch, use the AutoRound CLI or Python API directly, then load the result with Diffusers.

torch.compile

AutoRound is compatible with torch.compile for faster inference. You can compile the quantized transformer (DiT) for better performance:

import torch
from diffusers import DiffusionPipeline, PipelineQuantizationConfig, AutoRoundConfig

pipeline_quant_config = PipelineQuantizationConfig(
    quant_mapping={"transformer": AutoRoundConfig(backend="auto")}
)
pipe = DiffusionPipeline.from_pretrained(
    "INCModel/Z-Image-W4A16-AutoRound",
    quantization_config=pipeline_quant_config,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)

pipe.transformer = torch.compile(pipe.transformer, mode="default", fullgraph=False)

Backends

AutoRound supports multiple inference backends for Weight-only quantized model. The backend controls which kernel handles dequantization during the forward pass. Set the backend parameter in AutoRoundConfig to choose one:

Backend	Value	Device	Requirements	Notes
Auto	`"auto"`	Any	—	Default. Automatically selects the best available backend.
PyTorch	`"torch"`	CPU / CUDA	—	Pure PyTorch implementation. Broadest compatibility.
Triton	`"tritonv2"`	CUDA	`triton`	Triton-based kernel for GPU inference.
ExllamaV2	`"exllamav2"`	CUDA	`gptqmodel>=5.8.0`	Good CUDA performance via the ExllamaV2 kernel.
Marlin	`"marlin"`	CUDA	`gptqmodel>=5.8.0`	Best CUDA performance via the Marlin kernel.

from diffusers import AutoRoundConfig

# Auto-select (default)
config = AutoRoundConfig()

# Explicit Triton backend for CUDA
config = AutoRoundConfig(backend="tritonv2")

# Marlin backend for best CUDA performance (requires gptqmodel>=5.8.0)
config = AutoRoundConfig(backend="marlin")

# ExllamaV2 backend for good CUDA performance (requires gptqmodel>=5.8.0)
config = AutoRoundConfig(backend="exllamav2")

# PyTorch backend for CPU/CUDA inference
config = AutoRoundConfig(backend="torch")

Save and load

save

load

Supported Quantization Schemes

AutoRound supports several Schemes:

W4A16(bits:4,group_size:128,sym:True,act_bits:16)
W8A16(bits:8,group_size:128,sym:True,act_bits:16)
W3A16(bits:3,group_size:128,sym:True,act_bits:16)
W2A16(bits:2,group_size:128,sym:True,act_bits:16)
GGUF:Q4_K_M(all Q_K,Q_0,Q*_1 provided by llamacpp are supported)
NVFP4(Experimental feature, recommend exporting to llm_compressor format.data_type nvfp4,act_data_type nvfp4,static_global_scale,group_size 16)
MXFP4(Research feature, no real kernel, Standard MXFP4, data_type mxfp,act_data_type mxfp,bits 4, act_bits 4, group_size 32)
MXINT4(Research feature, no real kernel, Standard MXINT4, data_type mxint,act_data_type mxint,bits 4, act_bits 4, group_size 32)
MXFP4_RCEIL(Research feature,no real kernel, NVIDIA’s variant, data_type mxfp,act_data_type mxfp_rceil,bits 4, act_bits 4, group_size 32)
MXFP8(Research feature, no real kernel, data_type mxfp,act_data_type mxfp_rceil,group_size 32)
FPW8A16(Research feature, no real kernel, data_type fp8,group_size 0->per tensor )
FP8_STATIC(Research feature, no real kernel, data_type:fp8,act_data_type:fp8,group_size -1 ->per channel, act_group_size=0->per tensor)

Besides, you could modify the group_size, bits, sym and many other configs you want, though there are maybe no real kernels.

Resources

Pre-quantized AutoRound models on the Hub

Update on GitHub