MiniMax-M2.7-EAGLE3 (draft model) โ€” v0.1-preview

An EAGLE3 speculative-decoding draft model for MiniMaxAI/MiniMax-M2.7. It proposes candidate tokens that the full MiniMax-M2.7 model verifies in a single forward pass, accelerating single-stream generation with no change to outputs (lossless speculative decoding).

โš ๏ธ v0.1-preview โ€” this is an intermediate training checkpoint (not yet fully converged). A converged release will follow. See Status below.

What it is

  • Method: EAGLE3 (LlamaForCausalLMEagle3), single draft layer.
  • Target: MiniMax-M2.7 (230B MoE, hidden 3072, vocab 200064).
  • Aux hidden-state taps: target layers [2, 31, 59] (low/mid/high).
  • Vocab compression: draft predicts the top 32,000 tokens (d2t/t2d mapping embedded in the weights), keeping the draft small and fast.
  • Size: ~0.25B params (single layer + fusion + compressed head); ~0.5 GB.

Measured quality

On an in-distribution mix (chat / code / math), served with SGLang against the target:

Metric Value
Mean accept length (tau) ~2.6 (range 2.3-3.05)

accept length is hardware-independent (it is a model property): the target accepts ~2.6 tokens per verification step on average.

Honest note on realized speedup (read before deploying)

Speculative decoding turns accept-length into wall-clock speedup only if the GPU interconnect is fast. On systems with NVLink or working PCIe P2P, tau2.6 translates to roughly **2x single-stream**. On a setup where tensor-parallel all-reduce is host-staged (no P2P), the draft's per-step communication overhead can offset the gain (approximately break-even). Validate on your hardware. The draft is most valuable on NVLink / PCIe-Gen5-P2P serving.

Usage (SGLang)

python3 -m sglang.launch_server \
  --model-path MiniMaxAI/MiniMax-M2.7 \
  --speculative-algorithm EAGLE3 \
  --speculative-draft-model-path Lorbus/MiniMax-M2.7-EAGLE3 \
  --speculative-num-steps 3 --speculative-eagle-topk 4 --speculative-num-draft-tokens 8 \
  --tensor-parallel-size <N>

Tune --speculative-num-steps / --eagle-topk / --num-draft-tokens for your hardware (smaller trees lower draft overhead on slower interconnects).

Training

  • Framework: SpecForge (EAGLE3, online).
  • Data: ~50K on-policy samples โ€” prompts from UltraChat + OpenCodeInstruct + CodeAlpaca + GSM8K + Hendrycks-MATH, with completions regenerated by MiniMax-M2.7 so the draft learns the target's own distribution.
  • Recipe: online hidden-state capture, draft-vocab 32000, rope_theta 5e6 (matched to target), ~3+ epochs.

Status / roadmap

  • v0.1-preview (this): intermediate checkpoint, tau~2.6.
  • Planned: converged release; a larger on-policy corpus; a DFlash variant.

License & attribution

This is a derivative work of MiniMax-M2.7 and is released under the MiniMax-M2.7 Non-Commercial License (it inherits the base model's terms):

  • Free for personal, self-hosted, research, experimentation, academic & non-profit use.
  • Commercial use requires prior written authorization from MiniMax (contact api@minimax.io, subject "M2.7 licensing"). Commercial deployments must prominently display "Built with MiniMax M2.7".

Not affiliated with or endorsed by MiniMax or NVIDIA. Community project. Provided "as is", without warranty.

Downloads last month
18
Safetensors
Model size
0.3B params
Tensor type
I64
ยท
BF16
ยท
BOOL
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Lorbus/MiniMax-M2.7-EAGLE3

Finetuned
(25)
this model