Ex0bit commited on
Commit
9166f06
·
verified ·
1 Parent(s): 1faa39c

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +193 -0
README.md ADDED
@@ -0,0 +1,193 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - MiniMaxAI/MiniMax-M2.5
5
+ - MiniMaxAI/MiniMax-M2.7
6
+ tags:
7
+ - merge
8
+ - slerp
9
+ - moe
10
+ - fp8
11
+ - minimax
12
+ - minimax_m2
13
+ - code
14
+ - reasoning
15
+ model_type: minimax_m2
16
+ pipeline_tag: text-generation
17
+ library_name: transformers
18
+ ---
19
+
20
+ # MiniMax-SLURPY
21
+
22
+ **A per-tensor empirical SLERP merge of [MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) and [MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) in native FP8.**
23
+
24
+ MiniMax-SLURPY combines M2.5's logic precision with M2.7's improved code generation and instruction following — without any additional training, fine-tuning, or RL. The merge is driven entirely by a full-model forensic analysis of the 96,103 tensor pairs between the two parent models.
25
+
26
+ ## Results
27
+
28
+ | Model | HumanEval pass@1 |
29
+ |---|---|
30
+ | MiniMax-M2.7 | 89.0% (146/164) |
31
+ | **MiniMax-SLURPY** | **86.6% (142/164)** |
32
+ | MiniMax-M2.5 | 85.4% (140/164) |
33
+
34
+ SLURPY beats M2.5 by 2 problems while preserving coherent thinking-mode output, tool calling support, and the full MiniMax-M2 architecture.
35
+
36
+ ## Architecture
37
+
38
+ Identical to MiniMax-M2.5 / M2.7 — this is a weight merge, not an architecture change:
39
+
40
+ - **Model type**: `minimax_m2` / `MiniMaxM2ForCausalLM`
41
+ - **Parameters**: 228.7B total, ~10B active (MoE)
42
+ - **Layers**: 62
43
+ - **Hidden size**: 3072
44
+ - **MoE**: 256 experts, top-8, sigmoid routing + learned bias
45
+ - **Attention**: 48 query / 8 KV heads (GQA 6:1), head_dim=128
46
+ - **Quantization**: FP8 (`float8_e4m3fn`), block size [128, 128]
47
+ - **Vocab**: 200,064 tokens
48
+ - **Context**: up to 196,608 tokens
49
+ - **Thinking**: Interleaved `<think>...</think>` (always-on)
50
+ - **`trust_remote_code=True` required**
51
+
52
+ ## Merge Method
53
+
54
+ **Per-tensor empirical SLERP** — each of the 96,103 checkpoint tensors gets its own interpolation ratio `t(k)` derived from the measured cosine similarity between M2.5 and M2.7 on that specific tensor:
55
+
56
+ ```
57
+ delta(k) = 1 - cos(M2.5_k, M2.7_k)
58
+ delta_norm(k) = clip(delta(k) / delta_p99, 0, 1)
59
+ t(k) = 0.50 + 0.35 * delta_norm(k)
60
+ ```
61
+
62
+ - **Tensors that barely changed** (cos ≈ 1.0) get `t ≈ 0.50` — neutral midpoint blend
63
+ - **Tensors that changed the most** (cos < 0.993, concentrated in layer 61 MoE experts) get `t = 0.85` — strong M2.7 bias
64
+ - **FP8 weights** are dequantized to BF16 before SLERP, then re-quantized to FP8 with fresh block-wise scales
65
+ - **Norms, gates, biases** use LERP in fp32 accumulator
66
+ - **model.norm.weight** passes through from M2.7 unchanged
67
+
68
+ ### Forensic findings that drove the schedule
69
+
70
+ A full-model forensic scan of all 96,103 tensor pairs revealed:
71
+ - **99.18%** of tensors sit in a tight cosine cluster around 0.9946 — most weights barely moved between M2.5 and M2.7
72
+ - **Layer 61 MoE experts** {76, 74, 61, 30, 43, 138, 226, 126, 58, 159} have deltas 2-5x baseline — this is where M2.7's training signal concentrates
73
+ - **scale_inv is 0% bit-identical** between M2.5 and M2.7 — the original merge plan's pass-through assumption would have silently corrupted every FP8 tensor. All scale_inv tensors are recomputed after merging.
74
+ - **lm_head.weight** (cos=0.9905, rel_l2=0.139) carries M2.7's vocabulary habits including improved import discipline
75
+
76
+ ## Serving with vLLM
77
+
78
+ Recommended command (8x H100 80GB):
79
+
80
+ ```bash
81
+ SAFETENSORS_FAST_GPU=1 vllm serve \
82
+ Ex0bit/MiniMax-SLURPY --trust-remote-code \
83
+ --enable-expert-parallel --tensor-parallel-size 8 \
84
+ --enable-auto-tool-choice --tool-call-parser minimax_m2 \
85
+ --reasoning-parser minimax_m2_append_think \
86
+ --enforce-eager
87
+ ```
88
+
89
+ For 4x GPU (no expert parallel):
90
+ ```bash
91
+ SAFETENSORS_FAST_GPU=1 vllm serve \
92
+ Ex0bit/MiniMax-SLURPY --trust-remote-code \
93
+ --tensor-parallel-size 4 \
94
+ --enable-auto-tool-choice --tool-call-parser minimax_m2 \
95
+ --reasoning-parser minimax_m2_append_think
96
+ ```
97
+
98
+ If you encounter CUDA memory errors, add:
99
+ ```bash
100
+ --compilation-config '{"cudagraph_mode": "PIECEWISE"}'
101
+ ```
102
+
103
+ ### Recommended sampling parameters
104
+
105
+ | Parameter | Value |
106
+ |---|---|
107
+ | temperature | 1.0 |
108
+ | top_p | 0.95 |
109
+ | top_k | 40 |
110
+
111
+ ### Important: preserve thinking in conversation history
112
+
113
+ MiniMax-M2 uses interleaved thinking. The model outputs `<think>...</think>` blocks during generation. **You must pass these back verbatim in conversation history.** Removing them degrades performance.
114
+
115
+ ## Tool Calling
116
+
117
+ Same format as MiniMax-M2.7. Tool calls use `<minimax:tool_call>` / `</minimax:tool_call>` XML wrappers:
118
+
119
+ ```xml
120
+ <minimax:tool_call>
121
+ <invoke name="get_weather">
122
+ <parameter name="city">San Francisco</parameter>
123
+ </invoke>
124
+ </minimax:tool_call>
125
+ ```
126
+
127
+ Enable with `--enable-auto-tool-choice --tool-call-parser minimax_m2` in vLLM.
128
+
129
+ ## Using with Transformers
130
+
131
+ ```python
132
+ from transformers import AutoModelForCausalLM, AutoTokenizer
133
+
134
+ model = AutoModelForCausalLM.from_pretrained(
135
+ "Ex0bit/MiniMax-SLURPY",
136
+ trust_remote_code=True,
137
+ torch_dtype="auto",
138
+ device_map="auto",
139
+ )
140
+ tokenizer = AutoTokenizer.from_pretrained(
141
+ "Ex0bit/MiniMax-SLURPY",
142
+ trust_remote_code=True,
143
+ )
144
+
145
+ messages = [{"role": "user", "content": "Write a Python function that reverses a linked list."}]
146
+ input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
147
+
148
+ with torch.no_grad():
149
+ output = model.generate(input_ids, max_new_tokens=2048, do_sample=True, temperature=1.0, top_p=0.95, top_k=40)
150
+
151
+ print(tokenizer.decode(output[0, input_ids.shape[1]:], skip_special_tokens=True))
152
+ ```
153
+
154
+ ## Config notes
155
+
156
+ - `use_mtp` is set to `False` in config.json (MTP tensors don't exist in the checkpoint despite the original config declaring them)
157
+ - `quantization_config` is preserved — this model is native FP8, not dequantized
158
+ - Chat template and tokenizer are copied from M2.7
159
+
160
+ ## Files
161
+
162
+ - 43 safetensors shards (~5 GB each, 214.3 GB total)
163
+ - Native FP8 (`float8_e4m3fn`) with block-wise `[128, 128]` scale factors
164
+ - `chat_template.jinja` — M2.7's chat template with tool calling support
165
+ - `modeling_minimax_m2.py` / `configuration_minimax_m2.py` — custom model code (requires `trust_remote_code=True`)
166
+
167
+ ## Merge code
168
+
169
+ The full merge pipeline (forensics scan, per-tensor SLERP, FP8 dequant/requant, validation gates) is open:
170
+
171
+ - Merge script: `merge_m25_m27.py`
172
+ - Per-tensor schedule: `merge_core/schedule.py`
173
+ - FP8 primitives: `merge_core/fp8_io.py`
174
+ - SLERP: `merge_core/slerp.py`
175
+ - Tensor classifier: `merge_core/tensor_classifier.py`
176
+ - Benchmark harness: `bench/run_bench.py`
177
+
178
+ ## Citation
179
+
180
+ ```
181
+ @misc{minimax-slurpy-2026,
182
+ title={MiniMax-SLURPY: Per-tensor empirical SLERP merge of MiniMax-M2.5 and M2.7},
183
+ author={Ex0bit},
184
+ year={2026},
185
+ url={https://huggingface.co/Ex0bit/MiniMax-SLURPY}
186
+ }
187
+ ```
188
+
189
+ ## Acknowledgments
190
+
191
+ - [MiniMax](https://www.minimaxi.com/) for the M2.5 and M2.7 base models
192
+ - Merge infrastructure adapted from the [PRISM abliteration pipeline](https://github.com/exobit)
193
+ - FP8 dequant/requant primitives derived from the MiniMax-M2.5-PRISM project