Boogu Image 0.1 Edit Turbo SDNQ UINT4 Static

SDNQ 4-bit unsigned static quantization of Boogu/Boogu-Image-0.1-Edit-Turbo.

Source checkpoint: main@2026-06-30 / 0049942e5cc8340ef5d5843b5574756d7c30be55.

What Is Quantized

Selected recipe: uint4-static-transformer-only.

Only the diffusion transformer is quantized with SDNQ UINT4 static weights. The MLLM instruction encoder, processor, scheduler, VAE, tokenizer assets, and Boogu pipeline code are copied from the upstream checkpoint.

Benchmark Setup

Pipeline: BooguImageTurboPipeline
Task: ti2i
Resolution: 1024x1024
Steps: 4
Guidance: text_guidance_scale=1.0, image_guidance_scale=1.0, empty_instruction_guidance_scale=0.0
DMD conditioning sigma: 0.0
Torch dtype: bfloat16
Prompt set: 10 prompts covering simple scenes, abstract imagery, public-domain style, a historical public figure, complex typography, dense Latin text, dense Russian text, and diagrams
Hardware: NVIDIA RTX PRO 6000 Blackwell Server Edition on a disposable RunPod pod with local container disk

Benchmark Summary

Model	Load	Cold gen	Hot mean	VRAM after load	VRAM during gen	VRAM after gen	Torch peak
original	21.166 s	11.769 s	7.618 s	36603 MB	41017 MB	41017 MB	37851.1953125 MB
sdnq	19.761 s	11.216 s	10.784 s	21855 MB	26293 MB	26293 MB	23711.0302734375 MB

Raw per-prompt metrics are in benchmark/*.metrics.csv and benchmark/*.metrics.jsonl. The combined summary is in benchmark/summary.json.

Consumer GPU Offload Smoke

These additional SDNQ rows were measured on NVIDIA GeForce RTX 5090, 32607 MB VRAM on a disposable RunPod pod. RTX 3090 and RTX 4090 allocation attempts were unavailable at run time, so RTX 5090 was used as the nearest available consumer-class fallback. Runtime: PyTorch 2.9.1+cu128, CUDA runtime 12.8, NVIDIA driver 580.126.09. For the Edit model, these rows used the same 10 reference images as the original comparison set.

Model	Load	Cold gen	Hot mean	VRAM after load	VRAM during gen	VRAM after gen	Torch peak
sdnq + model offload	7.126 s	32.162 s	20.328 s	510 MB	18368 MB	660 MB	17278.34 MB
sdnq + sequential offload	7.170 s	16.400 s	11.961 s	512 MB	2954 MB	1426 MB	2448.23 MB

Offload metrics are stored as benchmark/sdnq-model-offload.* and benchmark/sdnq-sequential-offload.*.

Usage

pip install -U git+https://github.com/boogu-project/Boogu-Image.git sdnq transformers accelerate safetensors huggingface_hub

import sys
import torch
from diffusers.models import AutoencoderKL
from huggingface_hub import snapshot_download
from transformers import AutoModelForImageTextToText, AutoProcessor
from boogu.models.transformers.transformer_boogu import BooguImageTransformer2DModel
from boogu.pipelines.boogu.pipeline_boogu_turbo import BooguImageTurboPipeline
from boogu.schedulers.scheduling_flow_match_euler_discrete_time_shifting import FlowMatchEulerDiscreteScheduler
from sdnq.loader import load_sdnq_model

repo_id = "WaveCut/Boogu-Image-0.1-Edit-Turbo-SDNQ-uint4-static"
device = "cuda:0"
repo_dir = snapshot_download(repo_id)

transformer_code_dir = f"{repo_dir}/transformer"
if transformer_code_dir not in sys.path:
    sys.path.insert(0, transformer_code_dir)

transformer = load_sdnq_model(
    f"{repo_dir}/transformer",
    model_cls=BooguImageTransformer2DModel,
    dtype=torch.bfloat16,
    device="cpu",
    dequantize_fp32=False,
    use_quantized_matmul=True,
)

pipe = BooguImageTurboPipeline(
    transformer=transformer,
    vae=AutoencoderKL.from_pretrained(f"{repo_dir}/vae", torch_dtype=torch.bfloat16),
    scheduler=FlowMatchEulerDiscreteScheduler.from_pretrained(f"{repo_dir}/scheduler"),
    mllm=AutoModelForImageTextToText.from_pretrained(
        f"{repo_dir}/mllm",
        torch_dtype=torch.bfloat16,
        trust_remote_code=True,
    ),
    processor=AutoProcessor.from_pretrained(f"{repo_dir}/processor", trust_remote_code=True),
).to(device)

# Text-to-image models use:
image = pipe(
    instruction=["A precise studio photograph of a glass lamp on a dark table"],
    negative_instruction="",
    empty_instruction="",
    height=1024,
    width=1024,
    num_inference_steps=4,
    text_guidance_scale=1.0,
    image_guidance_scale=1.0,
    empty_instruction_guidance_scale=0.0,
    use_dmd_student_inference=True,
    dmd_conditioning_sigma=0.0,
    generator=torch.Generator(device).manual_seed(42),
).images[0]

For image-editing models, also pass input_image_paths, input_images, align_res, and the same DMD settings used by upstream Boogu Edit Turbo.

You can also download explicitly with hf download WaveCut/Boogu-Image-0.1-Edit-Turbo-SDNQ-uint4-static --local-dir ./boogu-sdnq and set repo_dir = "./boogu-sdnq".

Quantization Recipe

{
  "dynamic_loss_threshold": null,
  "group_size": 0,
  "modules": [
    "transformer"
  ],
  "name": "uint4-static-transformer-only",
  "quant_conv": false,
  "quant_embedding": false,
  "svd_rank": 32,
  "svd_steps": 32,
  "use_dynamic_quantization": false,
  "use_svd": false,
  "weights_dtype": "uint4"
}

Release Contents

transformer/: SDNQ UINT4 static transformer weights and quantization_config.json
mllm/, processor/, scheduler/, vae/: copied from the upstream checkpoint
benchmark/: original and SDNQ metrics, summaries, and prompt outputs metadata
assets/original_vs_sdnq_edit.webp: native-resolution original-vs-SDNQ WebP comparison grid, quality 95
prompts.json, quantization_manifest.json, SHA256SUMS

Limitations

This is a quantized derivative and inherits upstream behavior and limitations.
The comparison set is a deployment smoke benchmark, not a preference study.
Text rendering, Cyrillic text, and small labels should still be inspected manually for production use.
Benchmark numbers depend on GPU, driver, CUDA, PyTorch, Transformers, Diffusers, Boogu code, and SDNQ versions.

Downloads last month: 32

Model tree for WaveCut/Boogu-Image-0.1-Edit-Turbo-SDNQ-uint4-static

Base model

Boogu/Boogu-Image-0.1-Edit-Turbo

Quantized

(2)

this model