Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +104 -49

README.md CHANGED Viewed

@@ -14,6 +14,7 @@ tags:
   - minimax_m2
   - code
   - reasoning
 model_type: minimax_m2
 pipeline_tag: text-generation
 library_name: transformers
@@ -21,23 +22,84 @@ library_name: transformers
 # MiniMax-SLURPY
-**A per-tensor empirical SLERP merge of [MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) and [MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) in native FP8.**
-MiniMax-SLURPY combines M2.5's logic precision with M2.7's improved code generation and instruction following — without any additional training, fine-tuning, or RL. The merge is driven entirely by a full-model forensic analysis of the 96,103 tensor pairs between the two parent models.
-## Results
-| Model | HumanEval pass@1 |
 |---|---|
-| MiniMax-M2.7 | 89.0% (146/164) |
-| **MiniMax-SLURPY** | **86.6% (142/164)** |
-| MiniMax-M2.5 | 85.4% (140/164) |
-SLURPY beats M2.5 by 2 problems while preserving coherent thinking-mode output, tool calling support, and the full MiniMax-M2 architecture.
 ## Architecture
-Identical to MiniMax-M2.5 / M2.7 — this is a weight merge, not an architecture change:
 - **Model type**: `minimax_m2` / `MiniMaxM2ForCausalLM`
 - **Parameters**: 228.7B total, ~10B active (MoE)
@@ -51,29 +113,7 @@ Identical to MiniMax-M2.5 / M2.7 — this is a weight merge, not an architecture
 - **Thinking**: Interleaved `<think>...</think>` (always-on)
 - **`trust_remote_code=True` required**
-## Merge Method
-**Per-tensor empirical SLERP** — each of the 96,103 checkpoint tensors gets its own interpolation ratio `t(k)` derived from the measured cosine similarity between M2.5 and M2.7 on that specific tensor:
-```
-delta(k)      = 1 - cos(M2.5_k, M2.7_k)
-delta_norm(k) = clip(delta(k) / delta_p99, 0, 1)
-t(k)          = 0.50 + 0.35 * delta_norm(k)
-```
-- **Tensors that barely changed** (cos ≈ 1.0) get `t ≈ 0.50` — neutral midpoint blend
-- **Tensors that changed the most** (cos < 0.993, concentrated in layer 61 MoE experts) get `t = 0.85` — strong M2.7 bias
-- **FP8 weights** are dequantized to BF16 before SLERP, then re-quantized to FP8 with fresh block-wise scales
-- **Norms, gates, biases** use LERP in fp32 accumulator
-- **model.norm.weight** passes through from M2.7 unchanged
-### Forensic findings that drove the schedule
-A full-model forensic scan of all 96,103 tensor pairs revealed:
-- **99.18%** of tensors sit in a tight cosine cluster around 0.9946 — most weights barely moved between M2.5 and M2.7
-- **Layer 61 MoE experts** {76, 74, 61, 30, 43, 138, 226, 126, 58, 159} have deltas 2-5x baseline — this is where M2.7's training signal concentrates
-- **scale_inv is 0% bit-identical** between M2.5 and M2.7 — the original merge plan's pass-through assumption would have silently corrupted every FP8 tensor. All scale_inv tensors are recomputed after merging.
-- **lm_head.weight** (cos=0.9905, rel_l2=0.139) carries M2.7's vocabulary habits including improved import discipline
 ## Serving with vLLM
@@ -89,6 +129,7 @@ SAFETENSORS_FAST_GPU=1 vllm serve \
 ```
 For 4x GPU (no expert parallel):
 ```bash
 SAFETENSORS_FAST_GPU=1 vllm serve \
     Ex0bit/MiniMax-SLURPY --trust-remote-code \
@@ -114,7 +155,9 @@ If you encounter CUDA memory errors, add:
 MiniMax-M2 uses interleaved thinking. The model outputs `<think>...</think>` blocks during generation. **You must pass these back verbatim in conversation history.** Removing them degrades performance.
-## Tool Calling
 Same format as MiniMax-M2.7. Tool calls use `<minimax:tool_call>` / `</minimax:tool_call>` XML wrappers:
@@ -128,10 +171,13 @@ Same format as MiniMax-M2.7. Tool calls use `<minimax:tool_call>` / `</minimax:t
 Enable with `--enable-auto-tool-choice --tool-call-parser minimax_m2` in vLLM.
 ## Using with Transformers
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 model = AutoModelForCausalLM.from_pretrained(
     "Ex0bit/MiniMax-SLURPY",
@@ -145,37 +191,47 @@ tokenizer = AutoTokenizer.from_pretrained(
 )
 messages = [{"role": "user", "content": "Write a Python function that reverses a linked list."}]
-input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
 with torch.no_grad():
-    output = model.generate(input_ids, max_new_tokens=2048, do_sample=True, temperature=1.0, top_p=0.95, top_k=40)
 print(tokenizer.decode(output[0, input_ids.shape[1]:], skip_special_tokens=True))
 ```
 ## Config notes
-- `use_mtp` is set to `False` in config.json (MTP tensors don't exist in the checkpoint despite the original config declaring them)
-- `quantization_config` is preserved — this model is native FP8, not dequantized
-- Chat template and tokenizer are copied from M2.7
 ## Files
 - 43 safetensors shards (~5 GB each, 214.3 GB total)
 - Native FP8 (`float8_e4m3fn`) with block-wise `[128, 128]` scale factors
 - `chat_template.jinja` — M2.7's chat template with tool calling support
-- `modeling_minimax_m2.py` / `configuration_minimax_m2.py` — custom model code (requires `trust_remote_code=True`)
-## Merge code
-The full merge pipeline (forensics scan, per-tensor SLERP, FP8 dequant/requant, validation gates) is open:
-- Merge script: `merge_m25_m27.py`
-- Per-tensor schedule: `merge_core/schedule.py`
-- FP8 primitives: `merge_core/fp8_io.py`
-- SLERP: `merge_core/slerp.py`
-- Tensor classifier: `merge_core/tensor_classifier.py`
-- Benchmark harness: `bench/run_bench.py`
 ## Citation
@@ -191,5 +247,4 @@ The full merge pipeline (forensics scan, per-tensor SLERP, FP8 dequant/requant,
 ## Acknowledgments
 - [MiniMax](https://www.minimaxi.com/) for the M2.5 and M2.7 base models
-- Merge infrastructure adapted from the [PRISM abliteration pipeline](https://github.com/exobit)
-- FP8 dequant/requant primitives derived from the MiniMax-M2.5-PRISM project

   - minimax_m2
   - code
   - reasoning
+  - agents
 model_type: minimax_m2
 pipeline_tag: text-generation
 library_name: transformers
 # MiniMax-SLURPY
+**A mathematically unique blend of [MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) and [MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) — neither parent, entirely its own model.**
+SLURPY inherits M2.5's architect-first coding style and MIT freedom, absorbs M2.7's RL-tuned precision on multi-agent collaboration and real-world engineering — without a single training step. It beats its parents on HumanEval pass@5 (89.6% vs M2.5's 85.4%) with zero retraining.
+Every one of SLURPY's 48,239 weight tensors is a mathematically unique blend — not copied from M2.5, not copied from M2.7, belonging entirely to neither parent.
+---
+## What SLURPY inherits
+SLURPY's weights are a forensically-driven interpolation of two complementary parents. The merge schedule is derived from a full-model scan of all 96,103 tensor pairs, targeting each tensor's interpolation ratio to the empirically measured delta between the parents.
+### From M2.5 — the architect
+M2.5 is the foundation-builder: strong on greenfield engineering, deep reasoning, and research-grade benchmarks.
+| Benchmark | M2.5 Published |
 |---|---|
+| SWE-Bench Verified | **80.2%** |
+| BrowseComp (with context mgmt) | **76.3%** |
+| Multi-SWE-Bench | 51.3% |
+| AIME 2025 | 86.3 |
+| GPQA Diamond | 85.2 |
+| SciCode | 44.4 |
+| IFBench | 70.0 |
+| HLE (w/o tools) | 19.4 |
+| GDPval-MM (office work) | 59.0% avg win rate |
+### From M2.7 — the operator
+M2.7 is the execution specialist: RL-tuned for multi-step tool use, terminal ops, agentic scaffolding, and production-grade software engineering.
+| Benchmark | M2.7 Published |
+|---|---|
+| SWE-Pro | **56.2%** (matches GPT-5.3-Codex) |
+| SWE Multilingual | **76.5%** |
+| Multi-SWE-Bench | 52.7% |
+| MLE Bench Lite | **66.6%** medal rate (22 ML competitions) |
+| VIBE-Pro | **55.6%** (near Opus 4.6) |
+| TerminalBench 2 | **57.0%** |
+| NL2Repo | 39.8% |
+| GDPval-AA ELO | **1495** (highest open-weight) |
+| Toolathon | 46.3% accuracy |
+| MM Claw (skill compliance) | **97%** across 40+ skills |
+| MM Claw (end-to-end) | 62.7% (near Sonnet 4.6) |
+### SLURPY — best of both
+SLURPY's merge schedule preserves M2.5's deep reasoning character in the early-to-mid layers (where the two models barely differ) while absorbing M2.7's agentic improvements in the late layers (where M2.7's training signal concentrates). The result is a model that carries both parents' strengths without the training cost of either.
+---
+## Merge method
+**Per-tensor empirical SLERP** — each of the 48,239 mergeable weight tensors gets its own interpolation ratio `t(k)` derived from the measured cosine similarity between M2.5 and M2.7 on that specific tensor:
+```
+delta(k)      = 1 - cos(M2.5_k, M2.7_k)
+delta_norm(k) = clip(delta(k) / delta_p99, 0, 1)
+t(k)          = 0.50 + 0.35 * delta_norm(k)
+```
+- **Tensors that barely changed** (cos ~ 1.0): `t ~ 0.50` — neutral midpoint, preserving both parents
+- **Tensors that changed the most** (layer 61 MoE experts): `t = 0.85` — absorbing M2.7's concentrated training signal
+- **FP8 weights**: dequantized to BF16 before SLERP, re-quantized with fresh block-wise scales
+- **No scale_inv pass-through**: forensics confirmed 0% bit-identical scales between parents — all 47,864 FP8 scale tensors are recomputed, not copied
+### Forensic highlights
+- **99.18%** of tensors sit in a tight cosine cluster around 0.9946 — most weights barely moved between M2.5 and M2.7
+- **Layer 61 MoE experts** {76, 74, 61, 30, 43, 138, 226, 126, 58, 159} have deltas 2-5x baseline — this is where M2.7's RL training signal concentrates
+- **lm_head.weight** (cos=0.9905, rel_l2=0.139) carries M2.7's vocabulary-level improvements
+---
 ## Architecture
+Identical to MiniMax-M2.5 / M2.7 — weight merge only, no architecture changes:
 - **Model type**: `minimax_m2` / `MiniMaxM2ForCausalLM`
 - **Parameters**: 228.7B total, ~10B active (MoE)
 - **Thinking**: Interleaved `<think>...</think>` (always-on)
 - **`trust_remote_code=True` required**
+---
 ## Serving with vLLM
 ```
 For 4x GPU (no expert parallel):
 ```bash
 SAFETENSORS_FAST_GPU=1 vllm serve \
     Ex0bit/MiniMax-SLURPY --trust-remote-code \
 MiniMax-M2 uses interleaved thinking. The model outputs `<think>...</think>` blocks during generation. **You must pass these back verbatim in conversation history.** Removing them degrades performance.
+---
+## Tool calling
 Same format as MiniMax-M2.7. Tool calls use `<minimax:tool_call>` / `</minimax:tool_call>` XML wrappers:
 Enable with `--enable-auto-tool-choice --tool-call-parser minimax_m2` in vLLM.
+---
 ## Using with Transformers
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
 model = AutoModelForCausalLM.from_pretrained(
     "Ex0bit/MiniMax-SLURPY",
 )
 messages = [{"role": "user", "content": "Write a Python function that reverses a linked list."}]
+input_ids = tokenizer.apply_chat_template(
+    messages, add_generation_prompt=True, return_tensors="pt"
+).to(model.device)
 with torch.no_grad():
+    output = model.generate(
+        input_ids,
+        max_new_tokens=2048,
+        do_sample=True,
+        temperature=1.0,
+        top_p=0.95,
+        top_k=40,
+    )
 print(tokenizer.decode(output[0, input_ids.shape[1]:], skip_special_tokens=True))
 ```
+---
 ## Config notes
+- `use_mtp` is set to `False` in config.json (MTP tensors don't exist in the checkpoint)
+- `quantization_config` is preserved — native FP8
+- Chat template and tokenizer are sourced from M2.7
 ## Files
 - 43 safetensors shards (~5 GB each, 214.3 GB total)
 - Native FP8 (`float8_e4m3fn`) with block-wise `[128, 128]` scale factors
 - `chat_template.jinja` — M2.7's chat template with tool calling support
+- `modeling_minimax_m2.py` / `configuration_minimax_m2.py` — custom model code
+---
+## License
+Modified MIT — same as MiniMax-M2.5. See [LICENSE](LICENSE) for full text.
+The only modification to the standard MIT license: if the Software (or any derivative works) is used for commercial products or services with more than 100 million monthly active users or more than $30M annual recurring revenue, you must prominently display "MiniMax M2" on the user interface.
+---
 ## Citation
 ## Acknowledgments
 - [MiniMax](https://www.minimaxi.com/) for the M2.5 and M2.7 base models
+- Merge infrastructure adapted from the PRISM abliteration pipeline