Kimi-K2.6 Eagle3.1 MLA

EAGLE3 draft model for speculative decoding with Kimi-K2.6-NVFP4.

Improved over kimi-k2.6-eagle3-mla with fc_norm and norm_output.

Features

fc_norm: Per-chunk RMSNorm on auxiliary hidden states before FC projection
norm_output: Uses post-norm hidden states as auxiliary output

Benchmark Results

3-token draft (num_speculative_tokens=3)

Benchmark	Baseline (k2.6-eagle3-mla)	Eagle3.1 (this)	Delta
GSM8K	3.191	3.195	+0.004
CEval	2.730	2.836	+0.106
HumanEval	3.192	3.134	-0.058
MATH500	3.183	3.130	-0.053
AIME24	3.013	2.966	-0.047
MTBench	2.602	2.611	+0.009
SPEED-Bench (coding)	3.030	3.013	-0.017
SPEED-Bench (math)	3.298	3.403	+0.105
SPEED-Bench (multilingual)	2.603	2.800	+0.197
SPEED-Bench (qa)	2.557	2.580	+0.023
SPEED-Bench (rag)	3.008	3.045	+0.037

Usage with vLLM

vllm serve nvidia/Kimi-K2.6-NVFP4 \
  --trust-remote-code \
  --tensor-parallel-size 4 \
  --tool-call-parser kimi_k2 \
  --enable-auto-tool-choice \
  --reasoning-parser kimi_k2 \
  --attention-backend tokenspeed_mla \
  --speculative-config '{"model":"lightseekorg/kimi-k2.6-eagle3.1-mla","method":"eagle3","num_speculative_tokens":3}' \
  --language-model-only

Note: Requires vLLM with PR #42764 and PR #43482 for fc_norm support.

Downloads last month: 12

Safetensors

Model size

3B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support