Kimi-K2.6-MXFP4

MXFP4 quantized version of moonshotai/Kimi-K2.6, quantized using AMD Quark via the quanto toolkit.

Benchmark Results

MMLU 5-shot log-likelihood evaluation. All runs use the same prompt format (no chat template) for direct comparability.

Precision	Model	MMLU 5-shot (acc)	Δ vs W4A16
W4A16 (compressed-tensors)	moonshotai/Kimi-K2.6 (official)	89.62%	—
MXFP4 (OCP MX, Quark RTN)	This model	89.05%	-0.57%

The official moonshotai/Kimi-K2.6 release uses W4A16 compressed-tensors quantization (WNA16 MoE method).

Quantization Details

Property	Value
Method	MXFP4 (MX Floating Point 4-bit)
Algorithm	RTN (Round-to-Nearest)
Weight dtype	FP4 (E2M1), OCP MX format
Activation dtype	FP4 (E2M1), dynamic per-group
Scale format	E8M0 (per-group of 32)
Group size	32
Tool	AMD Quark 0.11.1 + quanto

Model Architecture

Kimi-K2.6 is a 1-trillion parameter Mixture-of-Experts language model with:

Total parameters: ~1T
Active parameters per token: ~32B
Architecture: MoE with latent attention (MLA), 61 transformer layers
Experts: 384 routed + 1 shared expert per MoE layer, top-8 routing
Context length: 128K tokens

Usage

from vllm import LLM, SamplingParams

llm = LLM(
    model="haanjack/Kimi-K2.6-MXFP4",
    tensor_parallel_size=4,
    trust_remote_code=True,
    max_model_len=32768,
    enforce_eager=True,          # required: avoids HIP kernel crash during graph capture
    gpu_memory_utilization=0.85,
)

Required environment variables (AMD ROCm):

export QUARK_MXFP4_IMPL=triton   # use Triton kernel (avoids HIP C++ kernel crash on gfx950)
export PYTORCH_ROCM_ARCH=gfx950   # set to your GPU architecture for fast kernel compilation

Note: This model requires AMD Quark and a recent vLLM build with Quark support (quantization=quark). Tested with rocm/vllm-dev:nightly (vLLM 0.20.1rc1, ROCm 7.2, AMD MI355).

Serving with vLLM

QUARK_MXFP4_IMPL=triton PYTORCH_ROCM_ARCH=gfx950 \
python -m vllm.entrypoints.openai.api_server \
    --model haanjack/Kimi-K2.6-MXFP4 \
    --tensor-parallel-size 4 \
    --trust-remote-code \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.85 \
    --enforce-eager

Quantization Recipe

Quantized using the quanto CLI:

python -m quanto \
    --model_path moonshotai/Kimi-K2.6 \
    --output_dir ./kimi-k2.6-mxfp4 \
    --precision mxfp4 \
    --exclude_layers lm_head "*self_attn*" "*.gate" "*shared_experts*" "*embed*" "*norm*"

The mxfp4 precision triggers Quark's quantize_model_per_safetensor (file-to-file) path, which processes each safetensors shard independently without loading the full model into GPU memory.

Known Limitations

Requires --enforce-eager flag in vLLM (CUDA graph capture triggers a kernel crash with the Quark MXFP4 emulation backend on ROCm)
QUARK_MXFP4_IMPL=triton is required on gfx950 (MI355) hardware; the default HIP C++ kernel has a memory access bug on this architecture
Native MXFP4 compute kernels (AITER) are not yet available for w_mxfp4_a_mxfp4 scheme — weights are dequantized to BF16 on-the-fly during inference (emulation mode)

Downloads last month: 452

Safetensors

Model size

551B params

Tensor type

BF16

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for haanjack/Kimi-K2.6-MXFP4

Base model

moonshotai/Kimi-K2.6

Quantized

(33)

this model