Kimi-K2.6 Eagle3.1 MLA

EAGLE3 draft model for speculative decoding with Kimi-K2.6-NVFP4.

Improved over kimi-k2.6-eagle3-mla with fc_norm and norm_output.

Features

  • fc_norm: Per-chunk RMSNorm on auxiliary hidden states before FC projection
  • norm_output: Uses post-norm hidden states as auxiliary output

Benchmark Results

Target model: nvidia/Kimi-K2.6-NVFP4

3-token draft (num_speculative_tokens=3)

Benchmark Baseline (k2.6-eagle3-mla) Eagle3.1 (this) Delta
GSM8K 3.191 3.195 +0.004
CEval 2.730 2.836 +0.106
HumanEval 3.192 3.134 -0.058
MATH500 3.183 3.130 -0.053
AIME24 3.013 2.966 -0.047
MTBench 2.602 2.611 +0.009
SPEED-Bench (coding) 3.030 3.013 -0.017
SPEED-Bench (math) 3.298 3.403 +0.105
SPEED-Bench (multilingual) 2.603 2.800 +0.197
SPEED-Bench (qa) 2.557 2.580 +0.023
SPEED-Bench (rag) 3.008 3.045 +0.037

Usage with vLLM

vllm serve nvidia/Kimi-K2.6-NVFP4 \
  --trust-remote-code \
  --tensor-parallel-size 4 \
  --tool-call-parser kimi_k2 \
  --enable-auto-tool-choice \
  --reasoning-parser kimi_k2 \
  --attention-backend tokenspeed_mla \
  --speculative-config '{"model":"lightseekorg/kimi-k2.6-eagle3.1-mla","method":"eagle3","num_speculative_tokens":3}' \
  --language-model-only

Note: Requires vLLM with PR #42764 and PR #43482 for fc_norm support.

Downloads last month
12
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support