MiniMax-M2.7-EAGLE3 (draft model) โ v0.1-preview
An EAGLE3 speculative-decoding draft model for MiniMaxAI/MiniMax-M2.7. It proposes candidate tokens that the full MiniMax-M2.7 model verifies in a single forward pass, accelerating single-stream generation with no change to outputs (lossless speculative decoding).
โ ๏ธ v0.1-preview โ this is an intermediate training checkpoint (not yet fully converged). A converged release will follow. See Status below.
What it is
- Method: EAGLE3 (
LlamaForCausalLMEagle3), single draft layer. - Target: MiniMax-M2.7 (230B MoE, hidden 3072, vocab 200064).
- Aux hidden-state taps: target layers
[2, 31, 59](low/mid/high). - Vocab compression: draft predicts the top 32,000 tokens (
d2t/t2dmapping embedded in the weights), keeping the draft small and fast. - Size: ~0.25B params (single layer + fusion + compressed head); ~0.5 GB.
Measured quality
On an in-distribution mix (chat / code / math), served with SGLang against the target:
| Metric | Value |
|---|---|
| Mean accept length (tau) | ~2.6 (range 2.3-3.05) |
accept length is hardware-independent (it is a model property): the target accepts
~2.6 tokens per verification step on average.
Honest note on realized speedup (read before deploying)
Speculative decoding turns accept-length into wall-clock speedup only if the
GPU interconnect is fast. On systems with NVLink or working PCIe P2P, tau2.6
translates to roughly **2x single-stream**. On a setup where tensor-parallel
all-reduce is host-staged (no P2P), the draft's per-step communication overhead
can offset the gain (approximately break-even). Validate on your hardware. The draft
is most valuable on NVLink / PCIe-Gen5-P2P serving.
Usage (SGLang)
python3 -m sglang.launch_server \
--model-path MiniMaxAI/MiniMax-M2.7 \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path Lorbus/MiniMax-M2.7-EAGLE3 \
--speculative-num-steps 3 --speculative-eagle-topk 4 --speculative-num-draft-tokens 8 \
--tensor-parallel-size <N>
Tune --speculative-num-steps / --eagle-topk / --num-draft-tokens for your hardware
(smaller trees lower draft overhead on slower interconnects).
Training
- Framework: SpecForge (EAGLE3, online).
- Data: ~50K on-policy samples โ prompts from UltraChat + OpenCodeInstruct + CodeAlpaca + GSM8K + Hendrycks-MATH, with completions regenerated by MiniMax-M2.7 so the draft learns the target's own distribution.
- Recipe: online hidden-state capture, draft-vocab 32000, rope_theta 5e6 (matched to target), ~3+ epochs.
Status / roadmap
- v0.1-preview (this): intermediate checkpoint, tau~2.6.
- Planned: converged release; a larger on-policy corpus; a DFlash variant.
License & attribution
This is a derivative work of MiniMax-M2.7 and is released under the MiniMax-M2.7 Non-Commercial License (it inherits the base model's terms):
- Free for personal, self-hosted, research, experimentation, academic & non-profit use.
- Commercial use requires prior written authorization from MiniMax
(contact
api@minimax.io, subject "M2.7 licensing"). Commercial deployments must prominently display "Built with MiniMax M2.7".
Not affiliated with or endorsed by MiniMax or NVIDIA. Community project. Provided "as is", without warranty.
- Downloads last month
- 18
Model tree for Lorbus/MiniMax-M2.7-EAGLE3
Base model
MiniMaxAI/MiniMax-M2.7