Z-Image-Turbo FP8 — full

All transformer linears quantized to FP8: FFN (feed_forward.w[1-3]) plus full attention (attention.to_qkv, attention.to_out).

The overall best practical choice in our benchmark is online_fp8 (5.937 s/image, -24.2% vs BF16, peak GPU 13.34 GB — 38.4% less than BF16). It runs straight from the upstream Tongyi-MAI/Z-Image-Turbo checkpoint with --quantization-config '{"method":"fp8"}' — no separate download. The pre-quantized variants in this family are useful when you want deterministic static scales (calibrated offline, frozen on disk) or when online quantization is not desired.

This variant (full) → mean latency 6.163 s/image (-21.3% vs BF16), peak GPU 16.02 GB (26.0% less than BF16's 21.66 GB), image quality PSNR 20.58 dB / SSIM 0.7645 vs the BF16 reference at identical (prompt, seed).

This is one of 7 FP8 quantization policies of Tongyi-MAI/Z-Image-Turbo prepared for vLLM-Omni serving. Each variant differs only in which transformer submodules are quantized to FP8 (E4M3 with per-tensor static scales); everything else stays in BF16.

In the tables below, marks the overall best choice (online_fp8) and marks the row that corresponds to this repo.

Quick start (vLLM-Omni)

docker run --rm --gpus all --ipc=host \
  -p 18002:8002 \
  -e HF_HOME=/hf_cache \
  -e DIFFUSION_ATTENTION_BACKEND=TORCH_SDPA \
  -e VLLM_ATTENTION_BACKEND=FLASH_ATTN \
  -v /your/hf/cache:/hf_cache \
  your-vllm-omni:latest \
    vllm-omni serve bahadirakdemir/Z-Image-Turbo-FP8-full \
    --omni --host 0.0.0.0 --port 8002 \
    --gpu-memory-utilization 0.90 \
    --tensor-parallel-size 1 \
    --quantization-config '{"method":"modelopt","quant_method":"FP8","is_checkpoint_fp8_serialized":true,"kv_cache_quant_method":null,"exclude_modules":["model*","lm_head*"]}'

Then call the OpenAI-style image API:

curl -s http://127.0.0.1:18002/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "model": "bahadirakdemir/Z-Image-Turbo-FP8-full",
    "prompt": "a cinematic photograph of an old fisherman standing on a foggy pier at sunrise",
    "size": "1024x512",
    "num_inference_steps": 8,
    "guidance_scale": 0.0,
    "n": 1,
    "response_format": "b64_json"
  }'

Benchmark — all 7 FP8 policies at 8 steps, 1024×512 on NVIDIA GB10

Same hardware, same (prompt, seed) tuples, same Docker image, fresh container per variant.

Pass A — DIFFUSION_ATTENTION_BACKEND=SAGE_ATTN

Variant Mean latency (s) Steps/s Peak GPU (GB) Δ latency vs BF16 Mem savings PSNR vs BF16 (dB) SSIM vs BF16
bf16_base 7.831 1.02 21.66 +0.0% +0.0% ref ref
online_fp8 5.937 1.35 13.34 -24.2% +38.4% 21.05 0.7942
full 6.163 1.30 16.02 -21.3% +26.0% 20.58 0.7645
ffn-attn-qkv 6.007 1.33 16.48 -23.3% +23.9% 20.99 0.7766
ffn-attn-out 6.400 1.25 17.41 -18.3% +19.6% 19.76 0.7634
ffn-only 6.821 1.17 17.88 -12.9% +17.5% 20.66 0.7789
attn-only 7.055 1.13 19.74 -9.9% +8.8% 22.92 0.8413
attn-qkv-only 7.344 1.09 20.21 -6.2% +6.7% 24.41 0.8423
attn-out-only 7.623 1.05 21.18 -2.7% +2.2% 26.93 0.9027

PSNR / SSIM are computed against the BF16 reference image at the same prompt and seed. Higher is better; BF16 is ref (it's the reference).

Pass B — top variants at DIFFUSION_ATTENTION_BACKEND=TORCH_SDPA

On NVIDIA GB10 (compute capability SM 12.1), FLASH_ATTN is platform-gated and silently falls back to TORCH_SDPA. SageAttention has a small but consistent overhead vs TORCH_SDPA on Blackwell, so the production recommendation is TORCH_SDPA.

Variant Mean latency (s) Steps/s Peak GPU (GB) Δ latency vs BF16
bf16_base 7.145 1.12 21.66 +0.0%
online_fp8 5.637 1.42 13.32 -21.1%
full 5.706 1.40 16.03 -20.1%
ffn-attn-qkv 5.208 1.54 16.48 -27.1%

Image comparisons

Same 3 prompts rendered across every variant with identical seeds. Click for full resolution.

Prompt 0 — "a cinematic photograph of an old fisherman standing on a foggy pier at sunrise"

comparison_prompt0

Prompt 1 — "a detailed watercolor painting of a mountain village beside a clear blue lake"

comparison_prompt1

Prompt 2 — "a futuristic city street at night with neon reflections on wet pavement"

comparison_prompt2

Benchmark parameters

Parameter Value
Resolution 1024 × 512
Inference steps 8
Guidance scale 0.0
Seeds 1235, 1236, 1237 (one per prompt)
Warm-up images per variant 1
Timed runs per variant 3
Prompts 3 (foggy pier, mountain village, neon street)
Engine your-vllm-omni:latest (vLLM-Omni)
Server flags --omni --gpu-memory-utilization 0.90 --tensor-parallel-size 1
GPU memory utilization 0.90
Tensor parallel size 1
Hardware NVIDIA GB10 (Grace Blackwell, SM 12.1)
Latency metric Wall-clock seconds for one HTTP /v1/images/generations round-trip, averaged over 3 timed runs after 1 warm-up
GPU memory metric Peak per-process GPU memory sampled via nvidia-smi --query-compute-apps=pid,used_memory while the container is running
Image quality PSNR (dB) and SSIM, scikit-image, win_size=11, channel_axis=-1, against the BF16 image at the same (prompt, seed)

All 7 repos in this family

  • bahadirakdemir/Z-Image-Turbo-FP8-full ← you are here
  • bahadirakdemir/Z-Image-Turbo-FP8-ffn-only
  • bahadirakdemir/Z-Image-Turbo-FP8-ffn-attn-out
  • bahadirakdemir/Z-Image-Turbo-FP8-ffn-attn-qkv
  • bahadirakdemir/Z-Image-Turbo-FP8-attn-only
  • bahadirakdemir/Z-Image-Turbo-FP8-attn-out-only
  • bahadirakdemir/Z-Image-Turbo-FP8-attn-qkv-only

License & attribution

  • Apache 2.0, inherited from upstream Tongyi-MAI/Z-Image-Turbo.
  • Quantization: NVIDIA ModelOpt FP8 PTQ with 8 calibration prompts × 9 denoise steps each.
  • Serving: vLLM-Omni.
  • Benchmarked on NVIDIA GB10 (Grace Blackwell, SM 12.1), 2026-05-17.
Downloads last month
61
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bahadirakdemir/Z-Image-Turbo-FP8-full

Finetuned
(108)
this model