Kimi-K2.6 Eagle3.1 MLA
EAGLE3 draft model for speculative decoding with Kimi-K2.6-NVFP4.
Improved over kimi-k2.6-eagle3-mla with fc_norm and norm_output.
Features
- fc_norm: Per-chunk RMSNorm on auxiliary hidden states before FC projection
- norm_output: Uses post-norm hidden states as auxiliary output
Benchmark Results
Target model: nvidia/Kimi-K2.6-NVFP4
3-token draft (num_speculative_tokens=3)
| Benchmark | Baseline (k2.6-eagle3-mla) | Eagle3.1 (this) | Delta |
|---|---|---|---|
| GSM8K | 3.191 | 3.195 | +0.004 |
| CEval | 2.730 | 2.836 | +0.106 |
| HumanEval | 3.192 | 3.134 | -0.058 |
| MATH500 | 3.183 | 3.130 | -0.053 |
| AIME24 | 3.013 | 2.966 | -0.047 |
| MTBench | 2.602 | 2.611 | +0.009 |
| SPEED-Bench (coding) | 3.030 | 3.013 | -0.017 |
| SPEED-Bench (math) | 3.298 | 3.403 | +0.105 |
| SPEED-Bench (multilingual) | 2.603 | 2.800 | +0.197 |
| SPEED-Bench (qa) | 2.557 | 2.580 | +0.023 |
| SPEED-Bench (rag) | 3.008 | 3.045 | +0.037 |
Usage with vLLM
vllm serve nvidia/Kimi-K2.6-NVFP4 \
--trust-remote-code \
--tensor-parallel-size 4 \
--tool-call-parser kimi_k2 \
--enable-auto-tool-choice \
--reasoning-parser kimi_k2 \
--attention-backend tokenspeed_mla \
--speculative-config '{"model":"lightseekorg/kimi-k2.6-eagle3.1-mla","method":"eagle3","num_speculative_tokens":3}' \
--language-model-only
Note: Requires vLLM with PR #42764 and PR #43482 for fc_norm support.
- Downloads last month
- 12
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support