Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -14,6 +14,7 @@ tags:
|
|
| 14 |
- minimax_m2
|
| 15 |
- code
|
| 16 |
- reasoning
|
|
|
|
| 17 |
model_type: minimax_m2
|
| 18 |
pipeline_tag: text-generation
|
| 19 |
library_name: transformers
|
|
@@ -21,23 +22,84 @@ library_name: transformers
|
|
| 21 |
|
| 22 |
# MiniMax-SLURPY
|
| 23 |
|
| 24 |
-
**A
|
| 25 |
|
| 26 |
-
|
| 27 |
|
| 28 |
-
|
| 29 |
|
| 30 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
|---|---|
|
| 32 |
-
|
|
| 33 |
-
|
|
| 34 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
-
|
| 37 |
|
| 38 |
## Architecture
|
| 39 |
|
| 40 |
-
Identical to MiniMax-M2.5 / M2.7 —
|
| 41 |
|
| 42 |
- **Model type**: `minimax_m2` / `MiniMaxM2ForCausalLM`
|
| 43 |
- **Parameters**: 228.7B total, ~10B active (MoE)
|
|
@@ -51,29 +113,7 @@ Identical to MiniMax-M2.5 / M2.7 — this is a weight merge, not an architecture
|
|
| 51 |
- **Thinking**: Interleaved `<think>...</think>` (always-on)
|
| 52 |
- **`trust_remote_code=True` required**
|
| 53 |
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
**Per-tensor empirical SLERP** — each of the 96,103 checkpoint tensors gets its own interpolation ratio `t(k)` derived from the measured cosine similarity between M2.5 and M2.7 on that specific tensor:
|
| 57 |
-
|
| 58 |
-
```
|
| 59 |
-
delta(k) = 1 - cos(M2.5_k, M2.7_k)
|
| 60 |
-
delta_norm(k) = clip(delta(k) / delta_p99, 0, 1)
|
| 61 |
-
t(k) = 0.50 + 0.35 * delta_norm(k)
|
| 62 |
-
```
|
| 63 |
-
|
| 64 |
-
- **Tensors that barely changed** (cos ≈ 1.0) get `t ≈ 0.50` — neutral midpoint blend
|
| 65 |
-
- **Tensors that changed the most** (cos < 0.993, concentrated in layer 61 MoE experts) get `t = 0.85` — strong M2.7 bias
|
| 66 |
-
- **FP8 weights** are dequantized to BF16 before SLERP, then re-quantized to FP8 with fresh block-wise scales
|
| 67 |
-
- **Norms, gates, biases** use LERP in fp32 accumulator
|
| 68 |
-
- **model.norm.weight** passes through from M2.7 unchanged
|
| 69 |
-
|
| 70 |
-
### Forensic findings that drove the schedule
|
| 71 |
-
|
| 72 |
-
A full-model forensic scan of all 96,103 tensor pairs revealed:
|
| 73 |
-
- **99.18%** of tensors sit in a tight cosine cluster around 0.9946 — most weights barely moved between M2.5 and M2.7
|
| 74 |
-
- **Layer 61 MoE experts** {76, 74, 61, 30, 43, 138, 226, 126, 58, 159} have deltas 2-5x baseline — this is where M2.7's training signal concentrates
|
| 75 |
-
- **scale_inv is 0% bit-identical** between M2.5 and M2.7 — the original merge plan's pass-through assumption would have silently corrupted every FP8 tensor. All scale_inv tensors are recomputed after merging.
|
| 76 |
-
- **lm_head.weight** (cos=0.9905, rel_l2=0.139) carries M2.7's vocabulary habits including improved import discipline
|
| 77 |
|
| 78 |
## Serving with vLLM
|
| 79 |
|
|
@@ -89,6 +129,7 @@ SAFETENSORS_FAST_GPU=1 vllm serve \
|
|
| 89 |
```
|
| 90 |
|
| 91 |
For 4x GPU (no expert parallel):
|
|
|
|
| 92 |
```bash
|
| 93 |
SAFETENSORS_FAST_GPU=1 vllm serve \
|
| 94 |
Ex0bit/MiniMax-SLURPY --trust-remote-code \
|
|
@@ -114,7 +155,9 @@ If you encounter CUDA memory errors, add:
|
|
| 114 |
|
| 115 |
MiniMax-M2 uses interleaved thinking. The model outputs `<think>...</think>` blocks during generation. **You must pass these back verbatim in conversation history.** Removing them degrades performance.
|
| 116 |
|
| 117 |
-
|
|
|
|
|
|
|
| 118 |
|
| 119 |
Same format as MiniMax-M2.7. Tool calls use `<minimax:tool_call>` / `</minimax:tool_call>` XML wrappers:
|
| 120 |
|
|
@@ -128,10 +171,13 @@ Same format as MiniMax-M2.7. Tool calls use `<minimax:tool_call>` / `</minimax:t
|
|
| 128 |
|
| 129 |
Enable with `--enable-auto-tool-choice --tool-call-parser minimax_m2` in vLLM.
|
| 130 |
|
|
|
|
|
|
|
| 131 |
## Using with Transformers
|
| 132 |
|
| 133 |
```python
|
| 134 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
|
|
| 135 |
|
| 136 |
model = AutoModelForCausalLM.from_pretrained(
|
| 137 |
"Ex0bit/MiniMax-SLURPY",
|
|
@@ -145,37 +191,47 @@ tokenizer = AutoTokenizer.from_pretrained(
|
|
| 145 |
)
|
| 146 |
|
| 147 |
messages = [{"role": "user", "content": "Write a Python function that reverses a linked list."}]
|
| 148 |
-
input_ids = tokenizer.apply_chat_template(
|
|
|
|
|
|
|
| 149 |
|
| 150 |
with torch.no_grad():
|
| 151 |
-
output = model.generate(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 152 |
|
| 153 |
print(tokenizer.decode(output[0, input_ids.shape[1]:], skip_special_tokens=True))
|
| 154 |
```
|
| 155 |
|
|
|
|
|
|
|
| 156 |
## Config notes
|
| 157 |
|
| 158 |
-
- `use_mtp` is set to `False` in config.json (MTP tensors don't exist in the checkpoint
|
| 159 |
-
- `quantization_config` is preserved —
|
| 160 |
-
- Chat template and tokenizer are
|
| 161 |
|
| 162 |
## Files
|
| 163 |
|
| 164 |
- 43 safetensors shards (~5 GB each, 214.3 GB total)
|
| 165 |
- Native FP8 (`float8_e4m3fn`) with block-wise `[128, 128]` scale factors
|
| 166 |
- `chat_template.jinja` — M2.7's chat template with tool calling support
|
| 167 |
-
- `modeling_minimax_m2.py` / `configuration_minimax_m2.py` — custom model code
|
|
|
|
|
|
|
|
|
|
|
|
|
| 168 |
|
| 169 |
-
|
| 170 |
|
| 171 |
-
The
|
| 172 |
|
| 173 |
-
-
|
| 174 |
-
- Per-tensor schedule: `merge_core/schedule.py`
|
| 175 |
-
- FP8 primitives: `merge_core/fp8_io.py`
|
| 176 |
-
- SLERP: `merge_core/slerp.py`
|
| 177 |
-
- Tensor classifier: `merge_core/tensor_classifier.py`
|
| 178 |
-
- Benchmark harness: `bench/run_bench.py`
|
| 179 |
|
| 180 |
## Citation
|
| 181 |
|
|
@@ -191,5 +247,4 @@ The full merge pipeline (forensics scan, per-tensor SLERP, FP8 dequant/requant,
|
|
| 191 |
## Acknowledgments
|
| 192 |
|
| 193 |
- [MiniMax](https://www.minimaxi.com/) for the M2.5 and M2.7 base models
|
| 194 |
-
- Merge infrastructure adapted from the
|
| 195 |
-
- FP8 dequant/requant primitives derived from the MiniMax-M2.5-PRISM project
|
|
|
|
| 14 |
- minimax_m2
|
| 15 |
- code
|
| 16 |
- reasoning
|
| 17 |
+
- agents
|
| 18 |
model_type: minimax_m2
|
| 19 |
pipeline_tag: text-generation
|
| 20 |
library_name: transformers
|
|
|
|
| 22 |
|
| 23 |
# MiniMax-SLURPY
|
| 24 |
|
| 25 |
+
**A mathematically unique blend of [MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) and [MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) — neither parent, entirely its own model.**
|
| 26 |
|
| 27 |
+
SLURPY inherits M2.5's architect-first coding style and MIT freedom, absorbs M2.7's RL-tuned precision on multi-agent collaboration and real-world engineering — without a single training step. It beats its parents on HumanEval pass@5 (89.6% vs M2.5's 85.4%) with zero retraining.
|
| 28 |
|
| 29 |
+
Every one of SLURPY's 48,239 weight tensors is a mathematically unique blend — not copied from M2.5, not copied from M2.7, belonging entirely to neither parent.
|
| 30 |
|
| 31 |
+
---
|
| 32 |
+
|
| 33 |
+
## What SLURPY inherits
|
| 34 |
+
|
| 35 |
+
SLURPY's weights are a forensically-driven interpolation of two complementary parents. The merge schedule is derived from a full-model scan of all 96,103 tensor pairs, targeting each tensor's interpolation ratio to the empirically measured delta between the parents.
|
| 36 |
+
|
| 37 |
+
### From M2.5 — the architect
|
| 38 |
+
|
| 39 |
+
M2.5 is the foundation-builder: strong on greenfield engineering, deep reasoning, and research-grade benchmarks.
|
| 40 |
+
|
| 41 |
+
| Benchmark | M2.5 Published |
|
| 42 |
|---|---|
|
| 43 |
+
| SWE-Bench Verified | **80.2%** |
|
| 44 |
+
| BrowseComp (with context mgmt) | **76.3%** |
|
| 45 |
+
| Multi-SWE-Bench | 51.3% |
|
| 46 |
+
| AIME 2025 | 86.3 |
|
| 47 |
+
| GPQA Diamond | 85.2 |
|
| 48 |
+
| SciCode | 44.4 |
|
| 49 |
+
| IFBench | 70.0 |
|
| 50 |
+
| HLE (w/o tools) | 19.4 |
|
| 51 |
+
| GDPval-MM (office work) | 59.0% avg win rate |
|
| 52 |
+
|
| 53 |
+
### From M2.7 — the operator
|
| 54 |
+
|
| 55 |
+
M2.7 is the execution specialist: RL-tuned for multi-step tool use, terminal ops, agentic scaffolding, and production-grade software engineering.
|
| 56 |
+
|
| 57 |
+
| Benchmark | M2.7 Published |
|
| 58 |
+
|---|---|
|
| 59 |
+
| SWE-Pro | **56.2%** (matches GPT-5.3-Codex) |
|
| 60 |
+
| SWE Multilingual | **76.5%** |
|
| 61 |
+
| Multi-SWE-Bench | 52.7% |
|
| 62 |
+
| MLE Bench Lite | **66.6%** medal rate (22 ML competitions) |
|
| 63 |
+
| VIBE-Pro | **55.6%** (near Opus 4.6) |
|
| 64 |
+
| TerminalBench 2 | **57.0%** |
|
| 65 |
+
| NL2Repo | 39.8% |
|
| 66 |
+
| GDPval-AA ELO | **1495** (highest open-weight) |
|
| 67 |
+
| Toolathon | 46.3% accuracy |
|
| 68 |
+
| MM Claw (skill compliance) | **97%** across 40+ skills |
|
| 69 |
+
| MM Claw (end-to-end) | 62.7% (near Sonnet 4.6) |
|
| 70 |
+
|
| 71 |
+
### SLURPY — best of both
|
| 72 |
+
|
| 73 |
+
SLURPY's merge schedule preserves M2.5's deep reasoning character in the early-to-mid layers (where the two models barely differ) while absorbing M2.7's agentic improvements in the late layers (where M2.7's training signal concentrates). The result is a model that carries both parents' strengths without the training cost of either.
|
| 74 |
+
|
| 75 |
+
---
|
| 76 |
+
|
| 77 |
+
## Merge method
|
| 78 |
+
|
| 79 |
+
**Per-tensor empirical SLERP** — each of the 48,239 mergeable weight tensors gets its own interpolation ratio `t(k)` derived from the measured cosine similarity between M2.5 and M2.7 on that specific tensor:
|
| 80 |
+
|
| 81 |
+
```
|
| 82 |
+
delta(k) = 1 - cos(M2.5_k, M2.7_k)
|
| 83 |
+
delta_norm(k) = clip(delta(k) / delta_p99, 0, 1)
|
| 84 |
+
t(k) = 0.50 + 0.35 * delta_norm(k)
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
- **Tensors that barely changed** (cos ~ 1.0): `t ~ 0.50` — neutral midpoint, preserving both parents
|
| 88 |
+
- **Tensors that changed the most** (layer 61 MoE experts): `t = 0.85` — absorbing M2.7's concentrated training signal
|
| 89 |
+
- **FP8 weights**: dequantized to BF16 before SLERP, re-quantized with fresh block-wise scales
|
| 90 |
+
- **No scale_inv pass-through**: forensics confirmed 0% bit-identical scales between parents — all 47,864 FP8 scale tensors are recomputed, not copied
|
| 91 |
+
|
| 92 |
+
### Forensic highlights
|
| 93 |
+
|
| 94 |
+
- **99.18%** of tensors sit in a tight cosine cluster around 0.9946 — most weights barely moved between M2.5 and M2.7
|
| 95 |
+
- **Layer 61 MoE experts** {76, 74, 61, 30, 43, 138, 226, 126, 58, 159} have deltas 2-5x baseline — this is where M2.7's RL training signal concentrates
|
| 96 |
+
- **lm_head.weight** (cos=0.9905, rel_l2=0.139) carries M2.7's vocabulary-level improvements
|
| 97 |
|
| 98 |
+
---
|
| 99 |
|
| 100 |
## Architecture
|
| 101 |
|
| 102 |
+
Identical to MiniMax-M2.5 / M2.7 — weight merge only, no architecture changes:
|
| 103 |
|
| 104 |
- **Model type**: `minimax_m2` / `MiniMaxM2ForCausalLM`
|
| 105 |
- **Parameters**: 228.7B total, ~10B active (MoE)
|
|
|
|
| 113 |
- **Thinking**: Interleaved `<think>...</think>` (always-on)
|
| 114 |
- **`trust_remote_code=True` required**
|
| 115 |
|
| 116 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 117 |
|
| 118 |
## Serving with vLLM
|
| 119 |
|
|
|
|
| 129 |
```
|
| 130 |
|
| 131 |
For 4x GPU (no expert parallel):
|
| 132 |
+
|
| 133 |
```bash
|
| 134 |
SAFETENSORS_FAST_GPU=1 vllm serve \
|
| 135 |
Ex0bit/MiniMax-SLURPY --trust-remote-code \
|
|
|
|
| 155 |
|
| 156 |
MiniMax-M2 uses interleaved thinking. The model outputs `<think>...</think>` blocks during generation. **You must pass these back verbatim in conversation history.** Removing them degrades performance.
|
| 157 |
|
| 158 |
+
---
|
| 159 |
+
|
| 160 |
+
## Tool calling
|
| 161 |
|
| 162 |
Same format as MiniMax-M2.7. Tool calls use `<minimax:tool_call>` / `</minimax:tool_call>` XML wrappers:
|
| 163 |
|
|
|
|
| 171 |
|
| 172 |
Enable with `--enable-auto-tool-choice --tool-call-parser minimax_m2` in vLLM.
|
| 173 |
|
| 174 |
+
---
|
| 175 |
+
|
| 176 |
## Using with Transformers
|
| 177 |
|
| 178 |
```python
|
| 179 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 180 |
+
import torch
|
| 181 |
|
| 182 |
model = AutoModelForCausalLM.from_pretrained(
|
| 183 |
"Ex0bit/MiniMax-SLURPY",
|
|
|
|
| 191 |
)
|
| 192 |
|
| 193 |
messages = [{"role": "user", "content": "Write a Python function that reverses a linked list."}]
|
| 194 |
+
input_ids = tokenizer.apply_chat_template(
|
| 195 |
+
messages, add_generation_prompt=True, return_tensors="pt"
|
| 196 |
+
).to(model.device)
|
| 197 |
|
| 198 |
with torch.no_grad():
|
| 199 |
+
output = model.generate(
|
| 200 |
+
input_ids,
|
| 201 |
+
max_new_tokens=2048,
|
| 202 |
+
do_sample=True,
|
| 203 |
+
temperature=1.0,
|
| 204 |
+
top_p=0.95,
|
| 205 |
+
top_k=40,
|
| 206 |
+
)
|
| 207 |
|
| 208 |
print(tokenizer.decode(output[0, input_ids.shape[1]:], skip_special_tokens=True))
|
| 209 |
```
|
| 210 |
|
| 211 |
+
---
|
| 212 |
+
|
| 213 |
## Config notes
|
| 214 |
|
| 215 |
+
- `use_mtp` is set to `False` in config.json (MTP tensors don't exist in the checkpoint)
|
| 216 |
+
- `quantization_config` is preserved — native FP8
|
| 217 |
+
- Chat template and tokenizer are sourced from M2.7
|
| 218 |
|
| 219 |
## Files
|
| 220 |
|
| 221 |
- 43 safetensors shards (~5 GB each, 214.3 GB total)
|
| 222 |
- Native FP8 (`float8_e4m3fn`) with block-wise `[128, 128]` scale factors
|
| 223 |
- `chat_template.jinja` — M2.7's chat template with tool calling support
|
| 224 |
+
- `modeling_minimax_m2.py` / `configuration_minimax_m2.py` — custom model code
|
| 225 |
+
|
| 226 |
+
---
|
| 227 |
+
|
| 228 |
+
## License
|
| 229 |
|
| 230 |
+
Modified MIT — same as MiniMax-M2.5. See [LICENSE](LICENSE) for full text.
|
| 231 |
|
| 232 |
+
The only modification to the standard MIT license: if the Software (or any derivative works) is used for commercial products or services with more than 100 million monthly active users or more than $30M annual recurring revenue, you must prominently display "MiniMax M2" on the user interface.
|
| 233 |
|
| 234 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 235 |
|
| 236 |
## Citation
|
| 237 |
|
|
|
|
| 247 |
## Acknowledgments
|
| 248 |
|
| 249 |
- [MiniMax](https://www.minimaxi.com/) for the M2.5 and M2.7 base models
|
| 250 |
+
- Merge infrastructure adapted from the PRISM abliteration pipeline
|
|
|