Boogu Image 0.1 Edit Turbo SDNQ UINT4 Static

SDNQ 4-bit unsigned static quantization of Boogu/Boogu-Image-0.1-Edit-Turbo.

Source checkpoint: main@2026-06-30 / 0049942e5cc8340ef5d5843b5574756d7c30be55.

Original vs SDNQ comparison

What Is Quantized

Selected recipe: uint4-static-transformer-only.

Only the diffusion transformer is quantized with SDNQ UINT4 static weights. The MLLM instruction encoder, processor, scheduler, VAE, tokenizer assets, and Boogu pipeline code are copied from the upstream checkpoint.

Benchmark Setup

  • Pipeline: BooguImageTurboPipeline
  • Task: ti2i
  • Resolution: 1024x1024
  • Steps: 4
  • Guidance: text_guidance_scale=1.0, image_guidance_scale=1.0, empty_instruction_guidance_scale=0.0
  • DMD conditioning sigma: 0.0
  • Torch dtype: bfloat16
  • Prompt set: 10 prompts covering simple scenes, abstract imagery, public-domain style, a historical public figure, complex typography, dense Latin text, dense Russian text, and diagrams
  • Hardware: NVIDIA RTX PRO 6000 Blackwell Server Edition on a disposable RunPod pod with local container disk

Benchmark Summary

Model Load Cold gen Hot mean VRAM after load VRAM during gen VRAM after gen Torch peak
original 21.166 s 11.769 s 7.618 s 36603 MB 41017 MB 41017 MB 37851.1953125 MB
sdnq 19.761 s 11.216 s 10.784 s 21855 MB 26293 MB 26293 MB 23711.0302734375 MB

Raw per-prompt metrics are in benchmark/*.metrics.csv and benchmark/*.metrics.jsonl. The combined summary is in benchmark/summary.json.

Consumer GPU Offload Smoke

These additional SDNQ rows were measured on NVIDIA GeForce RTX 5090, 32607 MB VRAM on a disposable RunPod pod. RTX 3090 and RTX 4090 allocation attempts were unavailable at run time, so RTX 5090 was used as the nearest available consumer-class fallback. Runtime: PyTorch 2.9.1+cu128, CUDA runtime 12.8, NVIDIA driver 580.126.09. For the Edit model, these rows used the same 10 reference images as the original comparison set.

Model Load Cold gen Hot mean VRAM after load VRAM during gen VRAM after gen Torch peak
sdnq + model offload 7.126 s 32.162 s 20.328 s 510 MB 18368 MB 660 MB 17278.34 MB
sdnq + sequential offload 7.170 s 16.400 s 11.961 s 512 MB 2954 MB 1426 MB 2448.23 MB

Offload metrics are stored as benchmark/sdnq-model-offload.* and benchmark/sdnq-sequential-offload.*.

Usage

pip install -U git+https://github.com/boogu-project/Boogu-Image.git sdnq transformers accelerate safetensors huggingface_hub
import sys
import torch
from diffusers.models import AutoencoderKL
from huggingface_hub import snapshot_download
from transformers import AutoModelForImageTextToText, AutoProcessor
from boogu.models.transformers.transformer_boogu import BooguImageTransformer2DModel
from boogu.pipelines.boogu.pipeline_boogu_turbo import BooguImageTurboPipeline
from boogu.schedulers.scheduling_flow_match_euler_discrete_time_shifting import FlowMatchEulerDiscreteScheduler
from sdnq.loader import load_sdnq_model

repo_id = "WaveCut/Boogu-Image-0.1-Edit-Turbo-SDNQ-uint4-static"
device = "cuda:0"
repo_dir = snapshot_download(repo_id)

transformer_code_dir = f"{repo_dir}/transformer"
if transformer_code_dir not in sys.path:
    sys.path.insert(0, transformer_code_dir)

transformer = load_sdnq_model(
    f"{repo_dir}/transformer",
    model_cls=BooguImageTransformer2DModel,
    dtype=torch.bfloat16,
    device="cpu",
    dequantize_fp32=False,
    use_quantized_matmul=True,
)

pipe = BooguImageTurboPipeline(
    transformer=transformer,
    vae=AutoencoderKL.from_pretrained(f"{repo_dir}/vae", torch_dtype=torch.bfloat16),
    scheduler=FlowMatchEulerDiscreteScheduler.from_pretrained(f"{repo_dir}/scheduler"),
    mllm=AutoModelForImageTextToText.from_pretrained(
        f"{repo_dir}/mllm",
        torch_dtype=torch.bfloat16,
        trust_remote_code=True,
    ),
    processor=AutoProcessor.from_pretrained(f"{repo_dir}/processor", trust_remote_code=True),
).to(device)
# Text-to-image models use:
image = pipe(
    instruction=["A precise studio photograph of a glass lamp on a dark table"],
    negative_instruction="",
    empty_instruction="",
    height=1024,
    width=1024,
    num_inference_steps=4,
    text_guidance_scale=1.0,
    image_guidance_scale=1.0,
    empty_instruction_guidance_scale=0.0,
    use_dmd_student_inference=True,
    dmd_conditioning_sigma=0.0,
    generator=torch.Generator(device).manual_seed(42),
).images[0]

For image-editing models, also pass input_image_paths, input_images, align_res, and the same DMD settings used by upstream Boogu Edit Turbo.

You can also download explicitly with hf download WaveCut/Boogu-Image-0.1-Edit-Turbo-SDNQ-uint4-static --local-dir ./boogu-sdnq and set repo_dir = "./boogu-sdnq".

Quantization Recipe

{
  "dynamic_loss_threshold": null,
  "group_size": 0,
  "modules": [
    "transformer"
  ],
  "name": "uint4-static-transformer-only",
  "quant_conv": false,
  "quant_embedding": false,
  "svd_rank": 32,
  "svd_steps": 32,
  "use_dynamic_quantization": false,
  "use_svd": false,
  "weights_dtype": "uint4"
}

Release Contents

  • transformer/: SDNQ UINT4 static transformer weights and quantization_config.json
  • mllm/, processor/, scheduler/, vae/: copied from the upstream checkpoint
  • benchmark/: original and SDNQ metrics, summaries, and prompt outputs metadata
  • assets/original_vs_sdnq_edit.webp: native-resolution original-vs-SDNQ WebP comparison grid, quality 95
  • prompts.json, quantization_manifest.json, SHA256SUMS

Limitations

  • This is a quantized derivative and inherits upstream behavior and limitations.
  • The comparison set is a deployment smoke benchmark, not a preference study.
  • Text rendering, Cyrillic text, and small labels should still be inspected manually for production use.
  • Benchmark numbers depend on GPU, driver, CUDA, PyTorch, Transformers, Diffusers, Boogu code, and SDNQ versions.
Downloads last month
32
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for WaveCut/Boogu-Image-0.1-Edit-Turbo-SDNQ-uint4-static

Quantized
(2)
this model