Z-Image-Turbo FP8 — `full`

All transformer linears quantized to FP8: FFN (feed_forward.w[1-3]) plus full attention (attention.to_qkv, attention.to_out).

★ The overall best practical choice in our benchmark is online_fp8 (5.937 s/image, -24.2% vs BF16, peak GPU 13.34 GB — 38.4% less than BF16). It runs straight from the upstream Tongyi-MAI/Z-Image-Turbo checkpoint with --quantization-config '{"method":"fp8"}' — no separate download. The pre-quantized variants in this family are useful when you want deterministic static scales (calibrated offline, frozen on disk) or when online quantization is not desired.

This variant (full) → mean latency 6.163 s/image (-21.3% vs BF16), peak GPU 16.02 GB (26.0% less than BF16's 21.66 GB), image quality PSNR 20.58 dB / SSIM 0.7645 vs the BF16 reference at identical (prompt, seed).

This is one of 7 FP8 quantization policies of Tongyi-MAI/Z-Image-Turbo prepared for vLLM-Omni serving. Each variant differs only in which transformer submodules are quantized to FP8 (E4M3 with per-tensor static scales); everything else stays in BF16.

In the tables below, ★ marks the overall best choice (online_fp8) and ← marks the row that corresponds to this repo.

Quick start (vLLM-Omni)

docker run --rm --gpus all --ipc=host \
  -p 18002:8002 \
  -e HF_HOME=/hf_cache \
  -e DIFFUSION_ATTENTION_BACKEND=TORCH_SDPA \
  -e VLLM_ATTENTION_BACKEND=FLASH_ATTN \
  -v /your/hf/cache:/hf_cache \
  your-vllm-omni:latest \
    vllm-omni serve bahadirakdemir/Z-Image-Turbo-FP8-full \
    --omni --host 0.0.0.0 --port 8002 \
    --gpu-memory-utilization 0.90 \
    --tensor-parallel-size 1 \
    --quantization-config '{"method":"modelopt","quant_method":"FP8","is_checkpoint_fp8_serialized":true,"kv_cache_quant_method":null,"exclude_modules":["model*","lm_head*"]}'

Then call the OpenAI-style image API:

curl -s http://127.0.0.1:18002/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "model": "bahadirakdemir/Z-Image-Turbo-FP8-full",
    "prompt": "a cinematic photograph of an old fisherman standing on a foggy pier at sunrise",
    "size": "1024x512",
    "num_inference_steps": 8,
    "guidance_scale": 0.0,
    "n": 1,
    "response_format": "b64_json"
  }'

Benchmark — all 7 FP8 policies at 8 steps, 1024×512 on NVIDIA GB10

Same hardware, same (prompt, seed) tuples, same Docker image, fresh container per variant.

Pass A — `DIFFUSION_ATTENTION_BACKEND=SAGE_ATTN`

	Variant	Mean latency (s)	Steps/s	Peak GPU (GB)	Δ latency vs BF16	Mem savings	PSNR vs BF16 (dB)	SSIM vs BF16
	`bf16_base`	7.831	1.02	21.66	+0.0%	+0.0%	ref	ref
★	`online_fp8`	5.937	1.35	13.34	-24.2%	+38.4%	21.05	0.7942
←	`full`	6.163	1.30	16.02	-21.3%	+26.0%	20.58	0.7645
	`ffn-attn-qkv`	6.007	1.33	16.48	-23.3%	+23.9%	20.99	0.7766
	`ffn-attn-out`	6.400	1.25	17.41	-18.3%	+19.6%	19.76	0.7634
	`ffn-only`	6.821	1.17	17.88	-12.9%	+17.5%	20.66	0.7789
	`attn-only`	7.055	1.13	19.74	-9.9%	+8.8%	22.92	0.8413
	`attn-qkv-only`	7.344	1.09	20.21	-6.2%	+6.7%	24.41	0.8423
	`attn-out-only`	7.623	1.05	21.18	-2.7%	+2.2%	26.93	0.9027

PSNR / SSIM are computed against the BF16 reference image at the same prompt and seed. Higher is better; BF16 is ref (it's the reference).

Pass B — top variants at `DIFFUSION_ATTENTION_BACKEND=TORCH_SDPA`

On NVIDIA GB10 (compute capability SM 12.1), FLASH_ATTN is platform-gated and silently falls back to TORCH_SDPA. SageAttention has a small but consistent overhead vs TORCH_SDPA on Blackwell, so the production recommendation is TORCH_SDPA.

	Variant	Mean latency (s)	Steps/s	Peak GPU (GB)	Δ latency vs BF16
	`bf16_base`	7.145	1.12	21.66	+0.0%
★	`online_fp8`	5.637	1.42	13.32	-21.1%
←	`full`	5.706	1.40	16.03	-20.1%
	`ffn-attn-qkv`	5.208	1.54	16.48	-27.1%

Image comparisons

Same 3 prompts rendered across every variant with identical seeds. Click for full resolution.

Prompt 0 — "a cinematic photograph of an old fisherman standing on a foggy pier at sunrise"

Prompt 1 — "a detailed watercolor painting of a mountain village beside a clear blue lake"

Prompt 2 — "a futuristic city street at night with neon reflections on wet pavement"

Benchmark parameters

Parameter	Value
Resolution	1024 × 512
Inference steps	8
Guidance scale	0.0
Seeds	1235, 1236, 1237 (one per prompt)
Warm-up images per variant	1
Timed runs per variant	3
Prompts	3 (foggy pier, mountain village, neon street)
Engine	`your-vllm-omni:latest` (vLLM-Omni)
Server flags	`--omni --gpu-memory-utilization 0.90 --tensor-parallel-size 1`
GPU memory utilization	0.90
Tensor parallel size	1
Hardware	NVIDIA GB10 (Grace Blackwell, SM 12.1)
Latency metric	Wall-clock seconds for one HTTP `/v1/images/generations` round-trip, averaged over 3 timed runs after 1 warm-up
GPU memory metric	Peak per-process GPU memory sampled via `nvidia-smi --query-compute-apps=pid,used_memory` while the container is running
Image quality	PSNR (dB) and SSIM, scikit-image, win_size=11, channel_axis=-1, against the BF16 image at the same (prompt, seed)

All 7 repos in this family

bahadirakdemir/Z-Image-Turbo-FP8-full ← you are here
bahadirakdemir/Z-Image-Turbo-FP8-ffn-only
bahadirakdemir/Z-Image-Turbo-FP8-ffn-attn-out
bahadirakdemir/Z-Image-Turbo-FP8-ffn-attn-qkv
bahadirakdemir/Z-Image-Turbo-FP8-attn-only
bahadirakdemir/Z-Image-Turbo-FP8-attn-out-only
bahadirakdemir/Z-Image-Turbo-FP8-attn-qkv-only

License & attribution

Apache 2.0, inherited from upstream Tongyi-MAI/Z-Image-Turbo.
Quantization: NVIDIA ModelOpt FP8 PTQ with 8 calibration prompts × 9 denoise steps each.
Serving: vLLM-Omni.
Benchmarked on NVIDIA GB10 (Grace Blackwell, SM 12.1), 2026-05-17.

Downloads last month: 61

Model tree for bahadirakdemir/Z-Image-Turbo-FP8-full

Base model

Tongyi-MAI/Z-Image-Turbo

Finetuned

(108)

this model

Z-Image-Turbo FP8 — full