Serving DeepSeek-v4-Fable on RTX PRO 6000 (SM120): checkpoint is BF16 but config declares fp8; compressor fused_wkv_wgate scale KeyError

#9
by dradra0 - opened

Thanks for releasing DeepSeek-v4-Fable! I'm trying to serve it on 8Γ— RTX PRO 6000 (SM120) and hit a checkpoint-format question I can't resolve.

Observations

  • The published safetensors are all BF16 (merge_info.json: output_dtype: torch.bfloat16), but config.json still carries a quantization_config (fp8, e4m3, block [128,128], scale_fmt: ue8m0) inherited from the base. Loaders read that and try to load the BF16 weights as FP8 β†’ storage/shape errors (e.g. setStorage ... out of bounds).
  • Removing the stale quantization_config lets it load as BF16, but the SM120 DeepSeek-V4 kernels (vllm-ds4-sm120, b12x) are FP8-only, so the BF16 forward fails (ColumnParallelLinear has no attribute 'weight_scale_inv').
  • So I quantized it offline to FP8 block, matching sgl-project/DeepSeek-V4-Flash-FP8's layout: per-expert experts.N.wN.weight (F8_E4M3) + .scale (F32, [rows/128, cols/128]); attn wkv/wq_a/wq_b/wo_a/wo_b + indexer.wq_b quantized; compressor.* / indexer.compressor.* / weights_proj kept BF16.
  • With ununnilium/vllm-ds4-sm120:20260618 + Triton sparse MLA, experts and the main attention now load fine, but it fails at:
    KeyError: 'layers.N.attn.compressor.fused_wkv_wgate.weight_scale_inv'
    The model fuses compressor wkv+wgate into fused_wkv_wgate and expects a fused block-FP8 scale. Whether I quantize the compressor or leave it BF16, the param isn't in params_dict. Oddly sgl-project/DeepSeek-V4-Flash-FP8 also stores compressor.wkv/wgate as separate BF16 (no scale, no fused tensor), yet is reported to serve β€” so I'm clearly missing how the CSA compressor is meant to be quantized/registered.

Questions

  1. Is there an official FP8 (or otherwise SM120-servable) checkpoint of Fable, or the exact conversion/quantization script you used to produce it?
  2. Specifically, how should the CSA compressor (wkv/wgate) be quantized and named so the vLLM DeepSeek-V4 loader's attn.compressor.fused_wkv_wgate.weight_scale_inv is satisfied?
  3. Recommended serving command + image for RTX PRO 6000 (SM120)?

Thanks a lot β€” everything else (quant pipeline, image, format) is in place; this compressor param is the only blocker.

Sign up or log in to comment