Kimi-K2.6-MXFP4

MXFP4 quantized version of moonshotai/Kimi-K2.6, quantized using AMD Quark via the quanto toolkit.

Benchmark Results

MMLU 5-shot log-likelihood evaluation. All runs use the same prompt format (no chat template) for direct comparability.

Precision Model MMLU 5-shot (acc) Δ vs W4A16
W4A16 (compressed-tensors) moonshotai/Kimi-K2.6 (official) 89.62%
MXFP4 (OCP MX, Quark RTN) This model 89.05% -0.57%

The official moonshotai/Kimi-K2.6 release uses W4A16 compressed-tensors quantization (WNA16 MoE method).

Quantization Details

Property Value
Method MXFP4 (MX Floating Point 4-bit)
Algorithm RTN (Round-to-Nearest)
Weight dtype FP4 (E2M1), OCP MX format
Activation dtype FP4 (E2M1), dynamic per-group
Scale format E8M0 (per-group of 32)
Group size 32
Tool AMD Quark 0.11.1 + quanto

Model Architecture

Kimi-K2.6 is a 1-trillion parameter Mixture-of-Experts language model with:

  • Total parameters: ~1T
  • Active parameters per token: ~32B
  • Architecture: MoE with latent attention (MLA), 61 transformer layers
  • Experts: 384 routed + 1 shared expert per MoE layer, top-8 routing
  • Context length: 128K tokens

Usage

from vllm import LLM, SamplingParams

llm = LLM(
    model="haanjack/Kimi-K2.6-MXFP4",
    tensor_parallel_size=4,
    trust_remote_code=True,
    max_model_len=32768,
    enforce_eager=True,          # required: avoids HIP kernel crash during graph capture
    gpu_memory_utilization=0.85,
)

Required environment variables (AMD ROCm):

export QUARK_MXFP4_IMPL=triton   # use Triton kernel (avoids HIP C++ kernel crash on gfx950)
export PYTORCH_ROCM_ARCH=gfx950   # set to your GPU architecture for fast kernel compilation

Note: This model requires AMD Quark and a recent vLLM build with Quark support (quantization=quark). Tested with rocm/vllm-dev:nightly (vLLM 0.20.1rc1, ROCm 7.2, AMD MI355).

Serving with vLLM

QUARK_MXFP4_IMPL=triton PYTORCH_ROCM_ARCH=gfx950 \
python -m vllm.entrypoints.openai.api_server \
    --model haanjack/Kimi-K2.6-MXFP4 \
    --tensor-parallel-size 4 \
    --trust-remote-code \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.85 \
    --enforce-eager

Quantization Recipe

Quantized using the quanto CLI:

python -m quanto \
    --model_path moonshotai/Kimi-K2.6 \
    --output_dir ./kimi-k2.6-mxfp4 \
    --precision mxfp4 \
    --exclude_layers lm_head "*self_attn*" "*.gate" "*shared_experts*" "*embed*" "*norm*"

The mxfp4 precision triggers Quark's quantize_model_per_safetensor (file-to-file) path, which processes each safetensors shard independently without loading the full model into GPU memory.

Known Limitations

  • Requires --enforce-eager flag in vLLM (CUDA graph capture triggers a kernel crash with the Quark MXFP4 emulation backend on ROCm)
  • QUARK_MXFP4_IMPL=triton is required on gfx950 (MI355) hardware; the default HIP C++ kernel has a memory access bug on this architecture
  • Native MXFP4 compute kernels (AITER) are not yet available for w_mxfp4_a_mxfp4 scheme — weights are dequantized to BF16 on-the-fly during inference (emulation mode)
Downloads last month
452
Safetensors
Model size
551B params
Tensor type
BF16
·
U8
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for haanjack/Kimi-K2.6-MXFP4

Quantized
(33)
this model