Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,193 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
base_model:
|
| 4 |
+
- MiniMaxAI/MiniMax-M2.5
|
| 5 |
+
- MiniMaxAI/MiniMax-M2.7
|
| 6 |
+
tags:
|
| 7 |
+
- merge
|
| 8 |
+
- slerp
|
| 9 |
+
- moe
|
| 10 |
+
- fp8
|
| 11 |
+
- minimax
|
| 12 |
+
- minimax_m2
|
| 13 |
+
- code
|
| 14 |
+
- reasoning
|
| 15 |
+
model_type: minimax_m2
|
| 16 |
+
pipeline_tag: text-generation
|
| 17 |
+
library_name: transformers
|
| 18 |
+
---
|
| 19 |
+
|
| 20 |
+
# MiniMax-SLURPY
|
| 21 |
+
|
| 22 |
+
**A per-tensor empirical SLERP merge of [MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) and [MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) in native FP8.**
|
| 23 |
+
|
| 24 |
+
MiniMax-SLURPY combines M2.5's logic precision with M2.7's improved code generation and instruction following — without any additional training, fine-tuning, or RL. The merge is driven entirely by a full-model forensic analysis of the 96,103 tensor pairs between the two parent models.
|
| 25 |
+
|
| 26 |
+
## Results
|
| 27 |
+
|
| 28 |
+
| Model | HumanEval pass@1 |
|
| 29 |
+
|---|---|
|
| 30 |
+
| MiniMax-M2.7 | 89.0% (146/164) |
|
| 31 |
+
| **MiniMax-SLURPY** | **86.6% (142/164)** |
|
| 32 |
+
| MiniMax-M2.5 | 85.4% (140/164) |
|
| 33 |
+
|
| 34 |
+
SLURPY beats M2.5 by 2 problems while preserving coherent thinking-mode output, tool calling support, and the full MiniMax-M2 architecture.
|
| 35 |
+
|
| 36 |
+
## Architecture
|
| 37 |
+
|
| 38 |
+
Identical to MiniMax-M2.5 / M2.7 — this is a weight merge, not an architecture change:
|
| 39 |
+
|
| 40 |
+
- **Model type**: `minimax_m2` / `MiniMaxM2ForCausalLM`
|
| 41 |
+
- **Parameters**: 228.7B total, ~10B active (MoE)
|
| 42 |
+
- **Layers**: 62
|
| 43 |
+
- **Hidden size**: 3072
|
| 44 |
+
- **MoE**: 256 experts, top-8, sigmoid routing + learned bias
|
| 45 |
+
- **Attention**: 48 query / 8 KV heads (GQA 6:1), head_dim=128
|
| 46 |
+
- **Quantization**: FP8 (`float8_e4m3fn`), block size [128, 128]
|
| 47 |
+
- **Vocab**: 200,064 tokens
|
| 48 |
+
- **Context**: up to 196,608 tokens
|
| 49 |
+
- **Thinking**: Interleaved `<think>...</think>` (always-on)
|
| 50 |
+
- **`trust_remote_code=True` required**
|
| 51 |
+
|
| 52 |
+
## Merge Method
|
| 53 |
+
|
| 54 |
+
**Per-tensor empirical SLERP** — each of the 96,103 checkpoint tensors gets its own interpolation ratio `t(k)` derived from the measured cosine similarity between M2.5 and M2.7 on that specific tensor:
|
| 55 |
+
|
| 56 |
+
```
|
| 57 |
+
delta(k) = 1 - cos(M2.5_k, M2.7_k)
|
| 58 |
+
delta_norm(k) = clip(delta(k) / delta_p99, 0, 1)
|
| 59 |
+
t(k) = 0.50 + 0.35 * delta_norm(k)
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
- **Tensors that barely changed** (cos ≈ 1.0) get `t ≈ 0.50` — neutral midpoint blend
|
| 63 |
+
- **Tensors that changed the most** (cos < 0.993, concentrated in layer 61 MoE experts) get `t = 0.85` — strong M2.7 bias
|
| 64 |
+
- **FP8 weights** are dequantized to BF16 before SLERP, then re-quantized to FP8 with fresh block-wise scales
|
| 65 |
+
- **Norms, gates, biases** use LERP in fp32 accumulator
|
| 66 |
+
- **model.norm.weight** passes through from M2.7 unchanged
|
| 67 |
+
|
| 68 |
+
### Forensic findings that drove the schedule
|
| 69 |
+
|
| 70 |
+
A full-model forensic scan of all 96,103 tensor pairs revealed:
|
| 71 |
+
- **99.18%** of tensors sit in a tight cosine cluster around 0.9946 — most weights barely moved between M2.5 and M2.7
|
| 72 |
+
- **Layer 61 MoE experts** {76, 74, 61, 30, 43, 138, 226, 126, 58, 159} have deltas 2-5x baseline — this is where M2.7's training signal concentrates
|
| 73 |
+
- **scale_inv is 0% bit-identical** between M2.5 and M2.7 — the original merge plan's pass-through assumption would have silently corrupted every FP8 tensor. All scale_inv tensors are recomputed after merging.
|
| 74 |
+
- **lm_head.weight** (cos=0.9905, rel_l2=0.139) carries M2.7's vocabulary habits including improved import discipline
|
| 75 |
+
|
| 76 |
+
## Serving with vLLM
|
| 77 |
+
|
| 78 |
+
Recommended command (8x H100 80GB):
|
| 79 |
+
|
| 80 |
+
```bash
|
| 81 |
+
SAFETENSORS_FAST_GPU=1 vllm serve \
|
| 82 |
+
Ex0bit/MiniMax-SLURPY --trust-remote-code \
|
| 83 |
+
--enable-expert-parallel --tensor-parallel-size 8 \
|
| 84 |
+
--enable-auto-tool-choice --tool-call-parser minimax_m2 \
|
| 85 |
+
--reasoning-parser minimax_m2_append_think \
|
| 86 |
+
--enforce-eager
|
| 87 |
+
```
|
| 88 |
+
|
| 89 |
+
For 4x GPU (no expert parallel):
|
| 90 |
+
```bash
|
| 91 |
+
SAFETENSORS_FAST_GPU=1 vllm serve \
|
| 92 |
+
Ex0bit/MiniMax-SLURPY --trust-remote-code \
|
| 93 |
+
--tensor-parallel-size 4 \
|
| 94 |
+
--enable-auto-tool-choice --tool-call-parser minimax_m2 \
|
| 95 |
+
--reasoning-parser minimax_m2_append_think
|
| 96 |
+
```
|
| 97 |
+
|
| 98 |
+
If you encounter CUDA memory errors, add:
|
| 99 |
+
```bash
|
| 100 |
+
--compilation-config '{"cudagraph_mode": "PIECEWISE"}'
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
### Recommended sampling parameters
|
| 104 |
+
|
| 105 |
+
| Parameter | Value |
|
| 106 |
+
|---|---|
|
| 107 |
+
| temperature | 1.0 |
|
| 108 |
+
| top_p | 0.95 |
|
| 109 |
+
| top_k | 40 |
|
| 110 |
+
|
| 111 |
+
### Important: preserve thinking in conversation history
|
| 112 |
+
|
| 113 |
+
MiniMax-M2 uses interleaved thinking. The model outputs `<think>...</think>` blocks during generation. **You must pass these back verbatim in conversation history.** Removing them degrades performance.
|
| 114 |
+
|
| 115 |
+
## Tool Calling
|
| 116 |
+
|
| 117 |
+
Same format as MiniMax-M2.7. Tool calls use `<minimax:tool_call>` / `</minimax:tool_call>` XML wrappers:
|
| 118 |
+
|
| 119 |
+
```xml
|
| 120 |
+
<minimax:tool_call>
|
| 121 |
+
<invoke name="get_weather">
|
| 122 |
+
<parameter name="city">San Francisco</parameter>
|
| 123 |
+
</invoke>
|
| 124 |
+
</minimax:tool_call>
|
| 125 |
+
```
|
| 126 |
+
|
| 127 |
+
Enable with `--enable-auto-tool-choice --tool-call-parser minimax_m2` in vLLM.
|
| 128 |
+
|
| 129 |
+
## Using with Transformers
|
| 130 |
+
|
| 131 |
+
```python
|
| 132 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 133 |
+
|
| 134 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 135 |
+
"Ex0bit/MiniMax-SLURPY",
|
| 136 |
+
trust_remote_code=True,
|
| 137 |
+
torch_dtype="auto",
|
| 138 |
+
device_map="auto",
|
| 139 |
+
)
|
| 140 |
+
tokenizer = AutoTokenizer.from_pretrained(
|
| 141 |
+
"Ex0bit/MiniMax-SLURPY",
|
| 142 |
+
trust_remote_code=True,
|
| 143 |
+
)
|
| 144 |
+
|
| 145 |
+
messages = [{"role": "user", "content": "Write a Python function that reverses a linked list."}]
|
| 146 |
+
input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
|
| 147 |
+
|
| 148 |
+
with torch.no_grad():
|
| 149 |
+
output = model.generate(input_ids, max_new_tokens=2048, do_sample=True, temperature=1.0, top_p=0.95, top_k=40)
|
| 150 |
+
|
| 151 |
+
print(tokenizer.decode(output[0, input_ids.shape[1]:], skip_special_tokens=True))
|
| 152 |
+
```
|
| 153 |
+
|
| 154 |
+
## Config notes
|
| 155 |
+
|
| 156 |
+
- `use_mtp` is set to `False` in config.json (MTP tensors don't exist in the checkpoint despite the original config declaring them)
|
| 157 |
+
- `quantization_config` is preserved — this model is native FP8, not dequantized
|
| 158 |
+
- Chat template and tokenizer are copied from M2.7
|
| 159 |
+
|
| 160 |
+
## Files
|
| 161 |
+
|
| 162 |
+
- 43 safetensors shards (~5 GB each, 214.3 GB total)
|
| 163 |
+
- Native FP8 (`float8_e4m3fn`) with block-wise `[128, 128]` scale factors
|
| 164 |
+
- `chat_template.jinja` — M2.7's chat template with tool calling support
|
| 165 |
+
- `modeling_minimax_m2.py` / `configuration_minimax_m2.py` — custom model code (requires `trust_remote_code=True`)
|
| 166 |
+
|
| 167 |
+
## Merge code
|
| 168 |
+
|
| 169 |
+
The full merge pipeline (forensics scan, per-tensor SLERP, FP8 dequant/requant, validation gates) is open:
|
| 170 |
+
|
| 171 |
+
- Merge script: `merge_m25_m27.py`
|
| 172 |
+
- Per-tensor schedule: `merge_core/schedule.py`
|
| 173 |
+
- FP8 primitives: `merge_core/fp8_io.py`
|
| 174 |
+
- SLERP: `merge_core/slerp.py`
|
| 175 |
+
- Tensor classifier: `merge_core/tensor_classifier.py`
|
| 176 |
+
- Benchmark harness: `bench/run_bench.py`
|
| 177 |
+
|
| 178 |
+
## Citation
|
| 179 |
+
|
| 180 |
+
```
|
| 181 |
+
@misc{minimax-slurpy-2026,
|
| 182 |
+
title={MiniMax-SLURPY: Per-tensor empirical SLERP merge of MiniMax-M2.5 and M2.7},
|
| 183 |
+
author={Ex0bit},
|
| 184 |
+
year={2026},
|
| 185 |
+
url={https://huggingface.co/Ex0bit/MiniMax-SLURPY}
|
| 186 |
+
}
|
| 187 |
+
```
|
| 188 |
+
|
| 189 |
+
## Acknowledgments
|
| 190 |
+
|
| 191 |
+
- [MiniMax](https://www.minimaxi.com/) for the M2.5 and M2.7 base models
|
| 192 |
+
- Merge infrastructure adapted from the [PRISM abliteration pipeline](https://github.com/exobit)
|
| 193 |
+
- FP8 dequant/requant primitives derived from the MiniMax-M2.5-PRISM project
|