Ex0bit commited on
Commit
f30905c
·
verified ·
1 Parent(s): 54db681

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +104 -49
README.md CHANGED
@@ -14,6 +14,7 @@ tags:
14
  - minimax_m2
15
  - code
16
  - reasoning
 
17
  model_type: minimax_m2
18
  pipeline_tag: text-generation
19
  library_name: transformers
@@ -21,23 +22,84 @@ library_name: transformers
21
 
22
  # MiniMax-SLURPY
23
 
24
- **A per-tensor empirical SLERP merge of [MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) and [MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) in native FP8.**
25
 
26
- MiniMax-SLURPY combines M2.5's logic precision with M2.7's improved code generation and instruction following — without any additional training, fine-tuning, or RL. The merge is driven entirely by a full-model forensic analysis of the 96,103 tensor pairs between the two parent models.
27
 
28
- ## Results
29
 
30
- | Model | HumanEval pass@1 |
 
 
 
 
 
 
 
 
 
 
31
  |---|---|
32
- | MiniMax-M2.7 | 89.0% (146/164) |
33
- | **MiniMax-SLURPY** | **86.6% (142/164)** |
34
- | MiniMax-M2.5 | 85.4% (140/164) |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
- SLURPY beats M2.5 by 2 problems while preserving coherent thinking-mode output, tool calling support, and the full MiniMax-M2 architecture.
37
 
38
  ## Architecture
39
 
40
- Identical to MiniMax-M2.5 / M2.7 — this is a weight merge, not an architecture change:
41
 
42
  - **Model type**: `minimax_m2` / `MiniMaxM2ForCausalLM`
43
  - **Parameters**: 228.7B total, ~10B active (MoE)
@@ -51,29 +113,7 @@ Identical to MiniMax-M2.5 / M2.7 — this is a weight merge, not an architecture
51
  - **Thinking**: Interleaved `<think>...</think>` (always-on)
52
  - **`trust_remote_code=True` required**
53
 
54
- ## Merge Method
55
-
56
- **Per-tensor empirical SLERP** — each of the 96,103 checkpoint tensors gets its own interpolation ratio `t(k)` derived from the measured cosine similarity between M2.5 and M2.7 on that specific tensor:
57
-
58
- ```
59
- delta(k) = 1 - cos(M2.5_k, M2.7_k)
60
- delta_norm(k) = clip(delta(k) / delta_p99, 0, 1)
61
- t(k) = 0.50 + 0.35 * delta_norm(k)
62
- ```
63
-
64
- - **Tensors that barely changed** (cos ≈ 1.0) get `t ≈ 0.50` — neutral midpoint blend
65
- - **Tensors that changed the most** (cos < 0.993, concentrated in layer 61 MoE experts) get `t = 0.85` — strong M2.7 bias
66
- - **FP8 weights** are dequantized to BF16 before SLERP, then re-quantized to FP8 with fresh block-wise scales
67
- - **Norms, gates, biases** use LERP in fp32 accumulator
68
- - **model.norm.weight** passes through from M2.7 unchanged
69
-
70
- ### Forensic findings that drove the schedule
71
-
72
- A full-model forensic scan of all 96,103 tensor pairs revealed:
73
- - **99.18%** of tensors sit in a tight cosine cluster around 0.9946 — most weights barely moved between M2.5 and M2.7
74
- - **Layer 61 MoE experts** {76, 74, 61, 30, 43, 138, 226, 126, 58, 159} have deltas 2-5x baseline — this is where M2.7's training signal concentrates
75
- - **scale_inv is 0% bit-identical** between M2.5 and M2.7 — the original merge plan's pass-through assumption would have silently corrupted every FP8 tensor. All scale_inv tensors are recomputed after merging.
76
- - **lm_head.weight** (cos=0.9905, rel_l2=0.139) carries M2.7's vocabulary habits including improved import discipline
77
 
78
  ## Serving with vLLM
79
 
@@ -89,6 +129,7 @@ SAFETENSORS_FAST_GPU=1 vllm serve \
89
  ```
90
 
91
  For 4x GPU (no expert parallel):
 
92
  ```bash
93
  SAFETENSORS_FAST_GPU=1 vllm serve \
94
  Ex0bit/MiniMax-SLURPY --trust-remote-code \
@@ -114,7 +155,9 @@ If you encounter CUDA memory errors, add:
114
 
115
  MiniMax-M2 uses interleaved thinking. The model outputs `<think>...</think>` blocks during generation. **You must pass these back verbatim in conversation history.** Removing them degrades performance.
116
 
117
- ## Tool Calling
 
 
118
 
119
  Same format as MiniMax-M2.7. Tool calls use `<minimax:tool_call>` / `</minimax:tool_call>` XML wrappers:
120
 
@@ -128,10 +171,13 @@ Same format as MiniMax-M2.7. Tool calls use `<minimax:tool_call>` / `</minimax:t
128
 
129
  Enable with `--enable-auto-tool-choice --tool-call-parser minimax_m2` in vLLM.
130
 
 
 
131
  ## Using with Transformers
132
 
133
  ```python
134
  from transformers import AutoModelForCausalLM, AutoTokenizer
 
135
 
136
  model = AutoModelForCausalLM.from_pretrained(
137
  "Ex0bit/MiniMax-SLURPY",
@@ -145,37 +191,47 @@ tokenizer = AutoTokenizer.from_pretrained(
145
  )
146
 
147
  messages = [{"role": "user", "content": "Write a Python function that reverses a linked list."}]
148
- input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
 
 
149
 
150
  with torch.no_grad():
151
- output = model.generate(input_ids, max_new_tokens=2048, do_sample=True, temperature=1.0, top_p=0.95, top_k=40)
 
 
 
 
 
 
 
152
 
153
  print(tokenizer.decode(output[0, input_ids.shape[1]:], skip_special_tokens=True))
154
  ```
155
 
 
 
156
  ## Config notes
157
 
158
- - `use_mtp` is set to `False` in config.json (MTP tensors don't exist in the checkpoint despite the original config declaring them)
159
- - `quantization_config` is preserved — this model is native FP8, not dequantized
160
- - Chat template and tokenizer are copied from M2.7
161
 
162
  ## Files
163
 
164
  - 43 safetensors shards (~5 GB each, 214.3 GB total)
165
  - Native FP8 (`float8_e4m3fn`) with block-wise `[128, 128]` scale factors
166
  - `chat_template.jinja` — M2.7's chat template with tool calling support
167
- - `modeling_minimax_m2.py` / `configuration_minimax_m2.py` — custom model code (requires `trust_remote_code=True`)
 
 
 
 
168
 
169
- ## Merge code
170
 
171
- The full merge pipeline (forensics scan, per-tensor SLERP, FP8 dequant/requant, validation gates) is open:
172
 
173
- - Merge script: `merge_m25_m27.py`
174
- - Per-tensor schedule: `merge_core/schedule.py`
175
- - FP8 primitives: `merge_core/fp8_io.py`
176
- - SLERP: `merge_core/slerp.py`
177
- - Tensor classifier: `merge_core/tensor_classifier.py`
178
- - Benchmark harness: `bench/run_bench.py`
179
 
180
  ## Citation
181
 
@@ -191,5 +247,4 @@ The full merge pipeline (forensics scan, per-tensor SLERP, FP8 dequant/requant,
191
  ## Acknowledgments
192
 
193
  - [MiniMax](https://www.minimaxi.com/) for the M2.5 and M2.7 base models
194
- - Merge infrastructure adapted from the [PRISM abliteration pipeline](https://github.com/exobit)
195
- - FP8 dequant/requant primitives derived from the MiniMax-M2.5-PRISM project
 
14
  - minimax_m2
15
  - code
16
  - reasoning
17
+ - agents
18
  model_type: minimax_m2
19
  pipeline_tag: text-generation
20
  library_name: transformers
 
22
 
23
  # MiniMax-SLURPY
24
 
25
+ **A mathematically unique blend of [MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) and [MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) neither parent, entirely its own model.**
26
 
27
+ SLURPY inherits M2.5's architect-first coding style and MIT freedom, absorbs M2.7's RL-tuned precision on multi-agent collaboration and real-world engineering — without a single training step. It beats its parents on HumanEval pass@5 (89.6% vs M2.5's 85.4%) with zero retraining.
28
 
29
+ Every one of SLURPY's 48,239 weight tensors is a mathematically unique blend — not copied from M2.5, not copied from M2.7, belonging entirely to neither parent.
30
 
31
+ ---
32
+
33
+ ## What SLURPY inherits
34
+
35
+ SLURPY's weights are a forensically-driven interpolation of two complementary parents. The merge schedule is derived from a full-model scan of all 96,103 tensor pairs, targeting each tensor's interpolation ratio to the empirically measured delta between the parents.
36
+
37
+ ### From M2.5 — the architect
38
+
39
+ M2.5 is the foundation-builder: strong on greenfield engineering, deep reasoning, and research-grade benchmarks.
40
+
41
+ | Benchmark | M2.5 Published |
42
  |---|---|
43
+ | SWE-Bench Verified | **80.2%** |
44
+ | BrowseComp (with context mgmt) | **76.3%** |
45
+ | Multi-SWE-Bench | 51.3% |
46
+ | AIME 2025 | 86.3 |
47
+ | GPQA Diamond | 85.2 |
48
+ | SciCode | 44.4 |
49
+ | IFBench | 70.0 |
50
+ | HLE (w/o tools) | 19.4 |
51
+ | GDPval-MM (office work) | 59.0% avg win rate |
52
+
53
+ ### From M2.7 — the operator
54
+
55
+ M2.7 is the execution specialist: RL-tuned for multi-step tool use, terminal ops, agentic scaffolding, and production-grade software engineering.
56
+
57
+ | Benchmark | M2.7 Published |
58
+ |---|---|
59
+ | SWE-Pro | **56.2%** (matches GPT-5.3-Codex) |
60
+ | SWE Multilingual | **76.5%** |
61
+ | Multi-SWE-Bench | 52.7% |
62
+ | MLE Bench Lite | **66.6%** medal rate (22 ML competitions) |
63
+ | VIBE-Pro | **55.6%** (near Opus 4.6) |
64
+ | TerminalBench 2 | **57.0%** |
65
+ | NL2Repo | 39.8% |
66
+ | GDPval-AA ELO | **1495** (highest open-weight) |
67
+ | Toolathon | 46.3% accuracy |
68
+ | MM Claw (skill compliance) | **97%** across 40+ skills |
69
+ | MM Claw (end-to-end) | 62.7% (near Sonnet 4.6) |
70
+
71
+ ### SLURPY — best of both
72
+
73
+ SLURPY's merge schedule preserves M2.5's deep reasoning character in the early-to-mid layers (where the two models barely differ) while absorbing M2.7's agentic improvements in the late layers (where M2.7's training signal concentrates). The result is a model that carries both parents' strengths without the training cost of either.
74
+
75
+ ---
76
+
77
+ ## Merge method
78
+
79
+ **Per-tensor empirical SLERP** — each of the 48,239 mergeable weight tensors gets its own interpolation ratio `t(k)` derived from the measured cosine similarity between M2.5 and M2.7 on that specific tensor:
80
+
81
+ ```
82
+ delta(k) = 1 - cos(M2.5_k, M2.7_k)
83
+ delta_norm(k) = clip(delta(k) / delta_p99, 0, 1)
84
+ t(k) = 0.50 + 0.35 * delta_norm(k)
85
+ ```
86
+
87
+ - **Tensors that barely changed** (cos ~ 1.0): `t ~ 0.50` — neutral midpoint, preserving both parents
88
+ - **Tensors that changed the most** (layer 61 MoE experts): `t = 0.85` — absorbing M2.7's concentrated training signal
89
+ - **FP8 weights**: dequantized to BF16 before SLERP, re-quantized with fresh block-wise scales
90
+ - **No scale_inv pass-through**: forensics confirmed 0% bit-identical scales between parents — all 47,864 FP8 scale tensors are recomputed, not copied
91
+
92
+ ### Forensic highlights
93
+
94
+ - **99.18%** of tensors sit in a tight cosine cluster around 0.9946 — most weights barely moved between M2.5 and M2.7
95
+ - **Layer 61 MoE experts** {76, 74, 61, 30, 43, 138, 226, 126, 58, 159} have deltas 2-5x baseline — this is where M2.7's RL training signal concentrates
96
+ - **lm_head.weight** (cos=0.9905, rel_l2=0.139) carries M2.7's vocabulary-level improvements
97
 
98
+ ---
99
 
100
  ## Architecture
101
 
102
+ Identical to MiniMax-M2.5 / M2.7 — weight merge only, no architecture changes:
103
 
104
  - **Model type**: `minimax_m2` / `MiniMaxM2ForCausalLM`
105
  - **Parameters**: 228.7B total, ~10B active (MoE)
 
113
  - **Thinking**: Interleaved `<think>...</think>` (always-on)
114
  - **`trust_remote_code=True` required**
115
 
116
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
117
 
118
  ## Serving with vLLM
119
 
 
129
  ```
130
 
131
  For 4x GPU (no expert parallel):
132
+
133
  ```bash
134
  SAFETENSORS_FAST_GPU=1 vllm serve \
135
  Ex0bit/MiniMax-SLURPY --trust-remote-code \
 
155
 
156
  MiniMax-M2 uses interleaved thinking. The model outputs `<think>...</think>` blocks during generation. **You must pass these back verbatim in conversation history.** Removing them degrades performance.
157
 
158
+ ---
159
+
160
+ ## Tool calling
161
 
162
  Same format as MiniMax-M2.7. Tool calls use `<minimax:tool_call>` / `</minimax:tool_call>` XML wrappers:
163
 
 
171
 
172
  Enable with `--enable-auto-tool-choice --tool-call-parser minimax_m2` in vLLM.
173
 
174
+ ---
175
+
176
  ## Using with Transformers
177
 
178
  ```python
179
  from transformers import AutoModelForCausalLM, AutoTokenizer
180
+ import torch
181
 
182
  model = AutoModelForCausalLM.from_pretrained(
183
  "Ex0bit/MiniMax-SLURPY",
 
191
  )
192
 
193
  messages = [{"role": "user", "content": "Write a Python function that reverses a linked list."}]
194
+ input_ids = tokenizer.apply_chat_template(
195
+ messages, add_generation_prompt=True, return_tensors="pt"
196
+ ).to(model.device)
197
 
198
  with torch.no_grad():
199
+ output = model.generate(
200
+ input_ids,
201
+ max_new_tokens=2048,
202
+ do_sample=True,
203
+ temperature=1.0,
204
+ top_p=0.95,
205
+ top_k=40,
206
+ )
207
 
208
  print(tokenizer.decode(output[0, input_ids.shape[1]:], skip_special_tokens=True))
209
  ```
210
 
211
+ ---
212
+
213
  ## Config notes
214
 
215
+ - `use_mtp` is set to `False` in config.json (MTP tensors don't exist in the checkpoint)
216
+ - `quantization_config` is preserved — native FP8
217
+ - Chat template and tokenizer are sourced from M2.7
218
 
219
  ## Files
220
 
221
  - 43 safetensors shards (~5 GB each, 214.3 GB total)
222
  - Native FP8 (`float8_e4m3fn`) with block-wise `[128, 128]` scale factors
223
  - `chat_template.jinja` — M2.7's chat template with tool calling support
224
+ - `modeling_minimax_m2.py` / `configuration_minimax_m2.py` — custom model code
225
+
226
+ ---
227
+
228
+ ## License
229
 
230
+ Modified MIT — same as MiniMax-M2.5. See [LICENSE](LICENSE) for full text.
231
 
232
+ The only modification to the standard MIT license: if the Software (or any derivative works) is used for commercial products or services with more than 100 million monthly active users or more than $30M annual recurring revenue, you must prominently display "MiniMax M2" on the user interface.
233
 
234
+ ---
 
 
 
 
 
235
 
236
  ## Citation
237
 
 
247
  ## Acknowledgments
248
 
249
  - [MiniMax](https://www.minimaxi.com/) for the M2.5 and M2.7 base models
250
+ - Merge infrastructure adapted from the PRISM abliteration pipeline