编译和卸载量化模型

优化模型通常涉及推理速度和内存使用之间的权衡。例如，虽然缓存可以提高推理速度，但它也会增加内存消耗，因为它需要存储中间注意力层的输出。一种更平衡的优化策略结合了量化模型、torch.compile 和各种卸载方法。

查看 torch.compile 指南以了解更多关于编译以及如何在此处应用的信息。例如，区域编译可以显著减少编译时间，而不会放弃任何加速。

对于图像生成，结合量化和模型卸载通常可以在质量、速度和内存之间提供最佳权衡。组卸载对于图像生成效果不佳，因为如果计算内核更快完成，通常不可能完全重叠数据传输。这会导致 CPU 和 GPU 之间的一些通信开销。

对于视频生成，结合量化和组卸载往往更好，因为视频模型更受计算限制。

下表提供了优化策略组合及其对 Flux 延迟和内存使用的影响的比较。

组合	延迟 (s)	内存使用 (GB)
量化	32.602	14.9453
量化, torch.compile	25.847	14.9448
量化, torch.compile, 模型 CPU 卸载	32.312	12.2369

这些结果是在 Flux 上使用 RTX 4090 进行基准测试的。transformer 和 text_encoder 组件已量化。如果您有兴趣评估自己的模型，请参考[基准测试脚本](https://gist.github.com/sayakpaul/0db9d8eeeb3d2a0e5ed7cf0d9ca19b7d)。

本指南将向您展示如何使用 bitsandbytes 编译和卸载量化模型。确保您正在使用 PyTorch nightly 和最新版本的 bitsandbytes。

pip install -U bitsandbytes

量化和 torch.compile

首先通过量化模型来减少存储所需的内存，并编译它以加速推理。

配置 Dynamo capture_dynamic_output_shape_ops = True 以在编译 bitsandbytes 模型时处理动态输出。

import torch
from diffusers import DiffusionPipeline
from diffusers.quantizers import PipelineQuantizationConfig

torch._dynamo.config.capture_dynamic_output_shape_ops = True

# 量化
pipeline_quant_config = PipelineQuantizationConfig(
    quant_backend="bitsandbytes_4bit",
    quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
    components_to_quantize=["transformer", "text_encoder_2"],
)
pipeline = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    quantization_config=pipeline_quant_config,
    torch_dtype=torch.bfloat16,
).to("cuda")

# 编译
pipeline.transformer.to(memory_format=torch.channels_last)
pipeline.transformer.compile(mode="max-autotune", fullgraph=True)
pipeline("""
    cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California
    highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
"""
).images[0]

量化、torch.compile 和卸载

除了量化和 torch.compile，如果您需要进一步减少内存使用，可以尝试卸载。卸载根据需要将各种层或模型组件从 CPU 移动到 GPU 进行计算。

在卸载期间配置 Dynamo cache_size_limit 以避免过多的重新编译，并设置 capture_dynamic_output_shape_ops = True 以在编译 bitsandbytes 模型时处理动态输出。

model CPU offloading

group offloading

Update on GitHub

Diffusers

编译和卸载量化模型

量化和 torch.compile

量化、torch.compile 和卸载