test / wiki /Model-Compression-with-NNCF.md
bilegentile's picture
Upload folder using huggingface_hub
c19ca42 verified

A newer version of the Gradio SDK is available: 4.44.0

Upgrade

Usage

  1. Use Diffusers backend. Execution & Models -> Execution backend
  2. Go into Compute Settings
  3. Enable Compress Model weights with NNCF options
  4. Restart the WebUI if it's your first time using NNCF. Otherwise, just reload the model.

Features

  • Uses INT8, halves the model size
    Saves 3.4 GB of VRAM with SDXL
  • Works in Diffusers backend

Disadvantages

  • It is Autocast, GPU will still use 16 Bit to run the model and will be slower
  • Uses INT8, can break ControlNet
  • Using Lora will trigger model reload
  • Not implemented in Original backend
  • Fused projections are not compatible with NNCF

Options

These results compares NNCF 8 bit to 16 bit.

  • Model:
    Compresses UNet or Transformers part of the model.
    This is where the most memory savings happens for Stable Diffusion.

    SDXL: 2500 MB~ memory savings.
    SD 1.5: 750 MB~ memory savings.
    PixArt-XL-2: 600 MB~ memory savings.

  • Text Encoder:
    Compresses Text Encoder parts of the model.
    This is where the most memory savings happens for PixArt.

    PixArt-XL-2: 4750 MB~ memory savings.
    SDXL: 750 MB~ memory savings.
    SD 1.5: 120 MB~ memory savings.

  • VAE:
    Compresses VAE part of the model.
    Memory savings from compressing VAE is pretty small.

    SD 1.5 / SDXL / PixArt-XL-2: 75 MB~ memory savings.

  • 4 Bit Compression and Quantization:
    4 bit compression modes and quantization can be used with OpenVINO backend.
    For more info: https://github.com/vladmandic/automatic/wiki/OpenVINO#quantization