Instructions to use bahadirakdemir/Z-Image-Turbo-FP8-full with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use bahadirakdemir/Z-Image-Turbo-FP8-full with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("bahadirakdemir/Z-Image-Turbo-FP8-full", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps
- Draw Things
- DiffusionBee
Z-Image-Turbo FP8 — full
All transformer linears quantized to FP8: FFN (feed_forward.w[1-3]) plus full attention (attention.to_qkv, attention.to_out).
★ The overall best practical choice in our benchmark is online_fp8 (5.937 s/image, -24.2% vs BF16, peak GPU 13.34 GB — 38.4% less than BF16). It runs straight from the upstream Tongyi-MAI/Z-Image-Turbo checkpoint with --quantization-config '{"method":"fp8"}' — no separate download. The pre-quantized variants in this family are useful when you want deterministic static scales (calibrated offline, frozen on disk) or when online quantization is not desired.
This variant (full) → mean latency 6.163 s/image (-21.3% vs BF16), peak GPU 16.02 GB (26.0% less than BF16's 21.66 GB), image quality PSNR 20.58 dB / SSIM 0.7645 vs the BF16 reference at identical (prompt, seed).
This is one of 7 FP8 quantization policies of Tongyi-MAI/Z-Image-Turbo prepared for vLLM-Omni serving. Each variant differs only in which transformer submodules are quantized to FP8 (E4M3 with per-tensor static scales); everything else stays in BF16.
In the tables below, ★ marks the overall best choice (online_fp8) and ← marks the row that corresponds to this repo.
Quick start (vLLM-Omni)
docker run --rm --gpus all --ipc=host \
-p 18002:8002 \
-e HF_HOME=/hf_cache \
-e DIFFUSION_ATTENTION_BACKEND=TORCH_SDPA \
-e VLLM_ATTENTION_BACKEND=FLASH_ATTN \
-v /your/hf/cache:/hf_cache \
your-vllm-omni:latest \
vllm-omni serve bahadirakdemir/Z-Image-Turbo-FP8-full \
--omni --host 0.0.0.0 --port 8002 \
--gpu-memory-utilization 0.90 \
--tensor-parallel-size 1 \
--quantization-config '{"method":"modelopt","quant_method":"FP8","is_checkpoint_fp8_serialized":true,"kv_cache_quant_method":null,"exclude_modules":["model*","lm_head*"]}'
Then call the OpenAI-style image API:
curl -s http://127.0.0.1:18002/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "bahadirakdemir/Z-Image-Turbo-FP8-full",
"prompt": "a cinematic photograph of an old fisherman standing on a foggy pier at sunrise",
"size": "1024x512",
"num_inference_steps": 8,
"guidance_scale": 0.0,
"n": 1,
"response_format": "b64_json"
}'
Benchmark — all 7 FP8 policies at 8 steps, 1024×512 on NVIDIA GB10
Same hardware, same (prompt, seed) tuples, same Docker image, fresh container per variant.
Pass A — DIFFUSION_ATTENTION_BACKEND=SAGE_ATTN
| Variant | Mean latency (s) | Steps/s | Peak GPU (GB) | Δ latency vs BF16 | Mem savings | PSNR vs BF16 (dB) | SSIM vs BF16 | |
|---|---|---|---|---|---|---|---|---|
bf16_base |
7.831 | 1.02 | 21.66 | +0.0% | +0.0% | ref | ref | |
| ★ | online_fp8 |
5.937 | 1.35 | 13.34 | -24.2% | +38.4% | 21.05 | 0.7942 |
| ← | full |
6.163 | 1.30 | 16.02 | -21.3% | +26.0% | 20.58 | 0.7645 |
ffn-attn-qkv |
6.007 | 1.33 | 16.48 | -23.3% | +23.9% | 20.99 | 0.7766 | |
ffn-attn-out |
6.400 | 1.25 | 17.41 | -18.3% | +19.6% | 19.76 | 0.7634 | |
ffn-only |
6.821 | 1.17 | 17.88 | -12.9% | +17.5% | 20.66 | 0.7789 | |
attn-only |
7.055 | 1.13 | 19.74 | -9.9% | +8.8% | 22.92 | 0.8413 | |
attn-qkv-only |
7.344 | 1.09 | 20.21 | -6.2% | +6.7% | 24.41 | 0.8423 | |
attn-out-only |
7.623 | 1.05 | 21.18 | -2.7% | +2.2% | 26.93 | 0.9027 |
PSNR / SSIM are computed against the BF16 reference image at the same prompt and seed. Higher is better; BF16 is ref (it's the reference).
Pass B — top variants at DIFFUSION_ATTENTION_BACKEND=TORCH_SDPA
On NVIDIA GB10 (compute capability SM 12.1), FLASH_ATTN is platform-gated and silently falls back to TORCH_SDPA. SageAttention has a small but consistent overhead vs TORCH_SDPA on Blackwell, so the production recommendation is TORCH_SDPA.
| Variant | Mean latency (s) | Steps/s | Peak GPU (GB) | Δ latency vs BF16 | |
|---|---|---|---|---|---|
bf16_base |
7.145 | 1.12 | 21.66 | +0.0% | |
| ★ | online_fp8 |
5.637 | 1.42 | 13.32 | -21.1% |
| ← | full |
5.706 | 1.40 | 16.03 | -20.1% |
ffn-attn-qkv |
5.208 | 1.54 | 16.48 | -27.1% |
Image comparisons
Same 3 prompts rendered across every variant with identical seeds. Click for full resolution.
Prompt 0 — "a cinematic photograph of an old fisherman standing on a foggy pier at sunrise"
Prompt 1 — "a detailed watercolor painting of a mountain village beside a clear blue lake"
Prompt 2 — "a futuristic city street at night with neon reflections on wet pavement"
Benchmark parameters
| Parameter | Value |
|---|---|
| Resolution | 1024 × 512 |
| Inference steps | 8 |
| Guidance scale | 0.0 |
| Seeds | 1235, 1236, 1237 (one per prompt) |
| Warm-up images per variant | 1 |
| Timed runs per variant | 3 |
| Prompts | 3 (foggy pier, mountain village, neon street) |
| Engine | your-vllm-omni:latest (vLLM-Omni) |
| Server flags | --omni --gpu-memory-utilization 0.90 --tensor-parallel-size 1 |
| GPU memory utilization | 0.90 |
| Tensor parallel size | 1 |
| Hardware | NVIDIA GB10 (Grace Blackwell, SM 12.1) |
| Latency metric | Wall-clock seconds for one HTTP /v1/images/generations round-trip, averaged over 3 timed runs after 1 warm-up |
| GPU memory metric | Peak per-process GPU memory sampled via nvidia-smi --query-compute-apps=pid,used_memory while the container is running |
| Image quality | PSNR (dB) and SSIM, scikit-image, win_size=11, channel_axis=-1, against the BF16 image at the same (prompt, seed) |
All 7 repos in this family
bahadirakdemir/Z-Image-Turbo-FP8-full← you are herebahadirakdemir/Z-Image-Turbo-FP8-ffn-onlybahadirakdemir/Z-Image-Turbo-FP8-ffn-attn-outbahadirakdemir/Z-Image-Turbo-FP8-ffn-attn-qkvbahadirakdemir/Z-Image-Turbo-FP8-attn-onlybahadirakdemir/Z-Image-Turbo-FP8-attn-out-onlybahadirakdemir/Z-Image-Turbo-FP8-attn-qkv-only
License & attribution
- Apache 2.0, inherited from upstream Tongyi-MAI/Z-Image-Turbo.
- Quantization: NVIDIA ModelOpt FP8 PTQ with 8 calibration prompts × 9 denoise steps each.
- Serving: vLLM-Omni.
- Benchmarked on NVIDIA GB10 (Grace Blackwell, SM 12.1), 2026-05-17.
- Downloads last month
- 61
Model tree for bahadirakdemir/Z-Image-Turbo-FP8-full
Base model
Tongyi-MAI/Z-Image-Turbo

