MiniMax-M2.7 β€” Greg-Knee Custom GGUF

A custom per-tensor, per-layer GGUF quantization of MiniMaxAI/MiniMax-M2.7 built for a specific three-tier memory topology (GPU + RAM + NVMe). The per-layer quant choice and memory-tier placement were generated by a mixed-integer program (MIP) optimizing a quality objective under tier byte budgets and a tokens/second floor.

TL;DR: 147 GiB total, ~5.53 BPW, 4 shards. Built for ~30 GiB combined VRAM + ~55 GiB usable RAM + fast NVMe. Predicted ~11 tok/s on a 3090+4070+DDR5+NVMe rig.

This is not a general-purpose quant. It's a specific point on a quality/size/speed surface chosen for my hardware. If your setup is different, you may want a different knee β€” the recipe/ folder has everything needed to generate your own.

Files

File Size Notes
MiniMax-M2.7-Greg-Knee-0000[1-4]-of-00004.gguf 147.2 GiB total the quantized model, 4 shards
recipe/tensor_types_final.txt 2 KB the rules file fed to llama-quantize
recipe/v2_tight_aggressive.json 118 KB MIP solver output (per-layer plan)
recipe/make_rules.py 14 KB generator: MIP JSON to rules file
logs/quantize.log ~120 KB full quantize stdout, all 809 tensors

Recipe overview

Quantization per tensor group

Tensor group Quant Why
token_embd.weight, output.weight Q8_0 embedding and LM head kept high
*_norm, ffn_gate_inp, exp_probs_b.bias F32 norms and router never quantized
attn_{q,k,v,output}.weight (all layers) Q8_0 attention kept high
ffn_gate_exps / ffn_up_exps - 43 layers Q5_K priority + most middle layers
ffn_gate_exps / ffn_up_exps - 19 layers Q4_K remaining non-priority middle
ffn_down_exps - 16 priority (layers 0-7, 54-61) Q6_K down is more sensitive, bookends matter
ffn_down_exps - 46 middle (layers 8-53) Q5_K

Decisions driven by:

  • ffn_down vs ffn_gate/up asymmetry β€” community evidence (anikifoss, Unsloth Qwen3.5 benchmarks) and outlier-weight research consistently show down-projections degrade faster than gate/up at the same bits. The MIP scores each separately.
  • Priority layers (first 8 + last 8) β€” KLD-sensitivity analyses and Unsloth's M2.7 NaN fix both point to the bookends of a MoE stack needing more precision.
  • NaN safety pin on blk.61.ffn_down_exps β€” Unsloth discovered this layer NaNs under aggressive quant; their fix upcasts it to Q6_K. We honor the same floor.

Tier placement (for my rig, reference only)

The MIP also picked which 62 MoE layers live on which memory tier:

  • GPU (10 layers): MoE experts loaded into VRAM alongside backbone
  • RAM (22 layers): mmap'd, kept hot in page cache
  • NVMe (30 layers): mmap'd, streamed from disk with OS page cache warm-up

Expressed at inference time via -ot regex to llama-server. For other setups this split needs recomputing.

Running

Standard llama-server works. For my rig:

./llama-server \
  -m MiniMax-M2.7-Greg-Knee-00001-of-00004.gguf \
  -ngl 99 \
  --tensor-split 67,33 \
  -ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|20|22|24|25|26|27|29|30|31|32|34|35|36|37|38|40|41|43|44|46|48|49|51|52|53|54|55|56|57|58|59|60|61)\.ffn_.*_exps\.weight=CPU" \
  -c 65536 \
  -ctk q8_0 -ctv q8_0 \
  -fa on \
  --threads 12 \
  --temp 1.0 --top-p 0.95 --top-k 40 \
  --jinja

Key bits:

  • -ngl 99 β€” offload all layers to GPU by default
  • --tensor-split 67,33 β€” for my 3090 (22 GiB usable) + 4070 (11 GiB usable) split
  • -ot ...=CPU β€” force 52 specific MoE layers off GPU; the other 10 stay on GPU
  • -ctk q8_0 -ctv q8_0 β€” quantized KV cache, doubles effective context for the same VRAM
  • Sampling params are MiniMax's recommended defaults

Adjust -ot for your hardware. The regex lists layer indices that should live on CPU. For more VRAM, drop layers from the list. For less, add them.

Source chain

  • Original weights: MiniMaxAI/MiniMax-M2.7 (FP8)
  • BF16 GGUF intermediate: unsloth/MiniMax-M2.7-GGUF (thanks Unsloth β€” saved the FP8 to BF16 upcast step)
  • Quantization: MIP + llama-quantize --tensor-type-file on llama.cpp commit fae3a2807 (April 14, 2026)
  • No imatrix β€” calibration set was unavailable during this session. Quality cost is ~2-5% PPL at these quant levels, below the MIP's own quality-score noise floor.

Notes / gotchas learned during this build

  • llama-quantize --tensor-type and --tensor-type-file accept only base ggml_type names (q4_k, q5_k, q6_k, q8_0, f32). The _M and _S variants shown in --help are rejected with parse_ggml_type: invalid ggml_type. Positional fallback quant accepts both name and numeric ID (17 = Q5_K_M).
  • --custom-q is a third-party script convention (anikifoss), NOT a built-in llama.cpp flag. The built-in is --tensor-type-file.
  • Current master llama.cpp automatically applies the blk.61.ffn_down_exps to Q6_K override for MiniMax-M2 arch. Explicit rules override harmlessly.
  • BF16 source from Unsloth is ~427 GB split across 10 shards. llama-quantize auto-follows the shard chain from shard 1 of N.

Credits

  • MiniMax AI for the base model
  • Unsloth team for the pre-upcasted BF16 GGUF and the NaN-pin discovery
  • anikifoss and ubergarm for documenting MiniMax-M2 tensor naming and custom quant patterns
  • llama.cpp maintainers for --tensor-type-file and the architecture support

License

Same as the base model (modified-mit).

Downloads last month
7
GGUF
Model size
229B params
Architecture
minimax-m2
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for W0lfland/MiniMax-M2.7-Greg-Knee-GGUF

Quantized
(115)
this model