Kimi-K2.5-P48-NVFP4-W4A4-Preview

A preview W4A4 compressed checkpoint of moonshotai/Kimi-K2.5, combining paired-4:8 structured sparsity with NVFP4 weight + activation quantization on MoE expert weights — targeting NVIDIA Blackwell sparse tensor cores.

Preview status. This is an early checkpoint release for community evaluation. Benchmark numbers below may move before the final release.

Why this release? Most production LLM compression today picks one axis: either pure quantization (NVFP4, FP8, INT4) or pure sparsity (2:4), and rarely both at once on frontier-scale MoE models. This release goes one step further and demonstrates that sparsity + W4A4 quantization is viable on a trillion-parameter MoE, with end-to-end NVFP4×NVFP4 grouped GEMMs via FlashInfer's MoE FP4 kernels. We hope it motivates more open work (i.e., kernels, recipes, and tooling) on combined sparse–quantized compression.

  • Base model: moonshotai/Kimi-K2.5 (MoE, 384 experts)
  • Compression scheme: NVFP4 W4A4 (W = NVFP4, A = NVFP4 dynamic-local) + paired-4:8 sparsity on non-shared experts
  • Effective precision: ~2 bits/weight on non-shared expert linears (paired-4:8 × NVFP4)
  • Inference path: FlashInfer NVFP4×NVFP4 grouped GEMMs (Blackwell B200/B300, SM100/SM120)
  • Checkpoint size: 595 GB (storage is still dense-NVFP4; sparse storage is future work — see Future Work)

Compression Details

Field Value
Weight dtype NVFP4 (E2M1)
Weight group size 16
Weight scale dtype FP8 E4M3, per-group
Weight global scale FP32, per-tensor
Activation dtype NVFP4 (E2M1), dynamic: "local"
Activation group size 16
Activation group scale dtype FP8 E4M3, per-group (computed per batch at runtime)
Activation global scale FP32, per-tensor (stored in checkpoint, per expert linear)
Sparsity Paired 4:8 (NVIDIA Blackwell)
Quantized + sparsified layers Non-shared MoE experts (gate_proj, up_proj, down_proj)
Uncompressed lm_head, self_attn.*, shared_experts.*, router, embeddings
Format compressed-tensors (NVFP4PackedCompressor)

Paired-4:8 sparsity. Every 8 contiguous elements form 4 pairs of 2; exactly 2 of the 4 pairs are nonzero. The zeroed positions are stored as FP4 zero codes inside weight_packed, so the sparsity structure is implicit — there is no separate bitmask tensor in the file.

Per-linear keys:

  • weight_packed — FP4 values, full K dimension
  • weight_scale — FP8 E4M3 per-16 group weight scales
  • weight_global_scale — FP32 per-tensor weight global scale
  • input_global_scale — FP32 per-tensor activation global scale

How to Use

The weight format is a standard NVFP4 checkpoint — any inference stack with compressed-tensors NVFP4 support loads it directly. The vLLM flags below cover Kimi-K2.5-specific runtime needs (custom model code, chat-template parsers).

vLLM (with FlashInfer NVFP4 MoE kernels)

The recipe below follows the upstream vLLM guide for Kimi-K2.5: https://recipes.vllm.ai/moonshotai/Kimi-K2.5. Refer to that page for advanced options (long context, prefix caching, structured output) and version-specific notes. Tested on 4xB200.

uv pip install -U vllm --torch-backend=auto

VLLM_USE_FLASHINFER_MOE_FP4=1 vllm serve ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview \
    --tensor-parallel-size 4 \
    --mm-encoder-tp-mode data \
    --trust-remote-code \
    --tool-call-parser kimi_k2 \
    --reasoning-parser kimi_k2

Flag notes:

  • VLLM_USE_FLASHINFER_MOE_FP4=1 — enables FlashInfer's NVFP4×NVFP4 grouped GEMM path for MoE experts.

Then query the OpenAI-compatible endpoint at http://localhost:8000/v1.

Hardware

  • Blackwell (SM100 / SM120, e.g. B200): native NVFP4×NVFP4 compute support (e.g., FLASHINFER, CUTLASS)
  • Tested on: 4× B200.

Evaluation — OpenLLM Leaderboard v1

All evaluations run with lm-evaluation-harness v0.4.11 against a vLLM 0.21.0 server on 4× B200 with VLLM_USE_FLASHINFER_MOE_FP4=1.

Benchmark Setup Base (BF16) SparseGPT + GPTQ one-shot Ours Δ vs base
ARC-Challenge acc_norm, 25-shot 74.23 62.54 68.43 −5.80
HellaSwag acc_norm, 10-shot 91.86 84.90 88.70 −3.16
MMLU acc, 5-shot 89.57 81.83 85.45 −4.12
TruthfulQA mc2, 0-shot 62.54 55.83 60.33 −2.21
Winogrande acc, 5-shot 82.48 79.95 83.35 +0.87
GSM8K exact_match, 5-shot 94.39 79.98 87.79 −6.60
Average 82.51 74.17 79.01 −3.50

Recovery: 79.01 / 82.51 = 95.76% of base-model average accuracy.

SparseGPT + GPTQ one-shot baseline. Reference point at the same compression target: SparseGPT picks the paired-4:8 mask, GPTQ quantizes the masked weights to NVFP4 (89.89% recovery, no activation quant).

Future Work

This preview ships the dense NVFP4 storage format with paired-4:8 zeros embedded as FP4 zero codes. That keeps the checkpoint compatible with current compressed-tensors and vLLM loaders out of the box, but leaves two opportunities on the table:

  1. Sparse NVFP4 storage — emit only the 4 nonzero pairs per 8-element block plus the ordered-metadata tensor (ElementE) that CUTLASS / cuSPARSELt sparse-NVFP4 kernels expect. This cuts the on-disk and HBM footprint of the expert weights roughly in half. The paired-4:8 mask is structurally preserved in the current dense FP4 codes, so the conversion can run as an offline post-processing step on top of the released checkpoint.
  2. CUTLASS sparse NVFP4 kernels — wire up sparse GEMM kernels (SM100/SM120) for sparse tensor-core throughput at inference. FlashInfer's MoE FP4 path is the current default dense kernel, and we expect further throughput by utilizing sparse GEMM.

Both are tracked for the next release, not this preview.

Contact

For questions or open a discussion on this preview, please fill free to reach out to kwanhee.lee@postech.ac.kr.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview

Quantized
(40)
this model

Papers for ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview

Evaluation results