Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +193 -0

README.md ADDED Viewed

	@@ -0,0 +1,193 @@

+---
+license: apache-2.0
+base_model:
+  - MiniMaxAI/MiniMax-M2.5
+  - MiniMaxAI/MiniMax-M2.7
+tags:
+  - merge
+  - slerp
+  - moe
+  - fp8
+  - minimax
+  - minimax_m2
+  - code
+  - reasoning
+model_type: minimax_m2
+pipeline_tag: text-generation
+library_name: transformers
+---
+# MiniMax-SLURPY
+**A per-tensor empirical SLERP merge of [MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) and [MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) in native FP8.**
+MiniMax-SLURPY combines M2.5's logic precision with M2.7's improved code generation and instruction following — without any additional training, fine-tuning, or RL. The merge is driven entirely by a full-model forensic analysis of the 96,103 tensor pairs between the two parent models.
+## Results
+| Model | HumanEval pass@1 |
+|---|---|
+| MiniMax-M2.7 | 89.0% (146/164) |
+| **MiniMax-SLURPY** | **86.6% (142/164)** |
+| MiniMax-M2.5 | 85.4% (140/164) |
+SLURPY beats M2.5 by 2 problems while preserving coherent thinking-mode output, tool calling support, and the full MiniMax-M2 architecture.
+## Architecture
+Identical to MiniMax-M2.5 / M2.7 — this is a weight merge, not an architecture change:
+- **Model type**: `minimax_m2` / `MiniMaxM2ForCausalLM`
+- **Parameters**: 228.7B total, ~10B active (MoE)
+- **Layers**: 62
+- **Hidden size**: 3072
+- **MoE**: 256 experts, top-8, sigmoid routing + learned bias
+- **Attention**: 48 query / 8 KV heads (GQA 6:1), head_dim=128
+- **Quantization**: FP8 (`float8_e4m3fn`), block size [128, 128]
+- **Vocab**: 200,064 tokens
+- **Context**: up to 196,608 tokens
+- **Thinking**: Interleaved `<think>...</think>` (always-on)
+- **`trust_remote_code=True` required**
+## Merge Method
+**Per-tensor empirical SLERP** — each of the 96,103 checkpoint tensors gets its own interpolation ratio `t(k)` derived from the measured cosine similarity between M2.5 and M2.7 on that specific tensor:
+```
+delta(k)      = 1 - cos(M2.5_k, M2.7_k)
+delta_norm(k) = clip(delta(k) / delta_p99, 0, 1)
+t(k)          = 0.50 + 0.35 * delta_norm(k)
+```
+- **Tensors that barely changed** (cos ≈ 1.0) get `t ≈ 0.50` — neutral midpoint blend
+- **Tensors that changed the most** (cos < 0.993, concentrated in layer 61 MoE experts) get `t = 0.85` — strong M2.7 bias
+- **FP8 weights** are dequantized to BF16 before SLERP, then re-quantized to FP8 with fresh block-wise scales
+- **Norms, gates, biases** use LERP in fp32 accumulator
+- **model.norm.weight** passes through from M2.7 unchanged
+### Forensic findings that drove the schedule
+A full-model forensic scan of all 96,103 tensor pairs revealed:
+- **99.18%** of tensors sit in a tight cosine cluster around 0.9946 — most weights barely moved between M2.5 and M2.7
+- **Layer 61 MoE experts** {76, 74, 61, 30, 43, 138, 226, 126, 58, 159} have deltas 2-5x baseline — this is where M2.7's training signal concentrates
+- **scale_inv is 0% bit-identical** between M2.5 and M2.7 — the original merge plan's pass-through assumption would have silently corrupted every FP8 tensor. All scale_inv tensors are recomputed after merging.
+- **lm_head.weight** (cos=0.9905, rel_l2=0.139) carries M2.7's vocabulary habits including improved import discipline
+## Serving with vLLM
+Recommended command (8x H100 80GB):
+```bash
+SAFETENSORS_FAST_GPU=1 vllm serve \
+    Ex0bit/MiniMax-SLURPY --trust-remote-code \
+    --enable-expert-parallel --tensor-parallel-size 8 \
+    --enable-auto-tool-choice --tool-call-parser minimax_m2 \
+    --reasoning-parser minimax_m2_append_think \
+    --enforce-eager
+```
+For 4x GPU (no expert parallel):
+```bash
+SAFETENSORS_FAST_GPU=1 vllm serve \
+    Ex0bit/MiniMax-SLURPY --trust-remote-code \
+    --tensor-parallel-size 4 \
+    --enable-auto-tool-choice --tool-call-parser minimax_m2 \
+    --reasoning-parser minimax_m2_append_think
+```
+If you encounter CUDA memory errors, add:
+```bash
+--compilation-config '{"cudagraph_mode": "PIECEWISE"}'
+```
+### Recommended sampling parameters
+| Parameter | Value |
+|---|---|
+| temperature | 1.0 |
+| top_p | 0.95 |
+| top_k | 40 |
+### Important: preserve thinking in conversation history
+MiniMax-M2 uses interleaved thinking. The model outputs `<think>...</think>` blocks during generation. **You must pass these back verbatim in conversation history.** Removing them degrades performance.
+## Tool Calling
+Same format as MiniMax-M2.7. Tool calls use `<minimax:tool_call>` / `</minimax:tool_call>` XML wrappers:
+```xml
+<minimax:tool_call>
+<invoke name="get_weather">
+<parameter name="city">San Francisco</parameter>
+</invoke>
+</minimax:tool_call>
+```
+Enable with `--enable-auto-tool-choice --tool-call-parser minimax_m2` in vLLM.
+## Using with Transformers
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model = AutoModelForCausalLM.from_pretrained(
+    "Ex0bit/MiniMax-SLURPY",
+    trust_remote_code=True,
+    torch_dtype="auto",
+    device_map="auto",
+)
+tokenizer = AutoTokenizer.from_pretrained(
+    "Ex0bit/MiniMax-SLURPY",
+    trust_remote_code=True,
+)
+messages = [{"role": "user", "content": "Write a Python function that reverses a linked list."}]
+input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
+with torch.no_grad():
+    output = model.generate(input_ids, max_new_tokens=2048, do_sample=True, temperature=1.0, top_p=0.95, top_k=40)
+print(tokenizer.decode(output[0, input_ids.shape[1]:], skip_special_tokens=True))
+```
+## Config notes
+- `use_mtp` is set to `False` in config.json (MTP tensors don't exist in the checkpoint despite the original config declaring them)
+- `quantization_config` is preserved — this model is native FP8, not dequantized
+- Chat template and tokenizer are copied from M2.7
+## Files
+- 43 safetensors shards (~5 GB each, 214.3 GB total)
+- Native FP8 (`float8_e4m3fn`) with block-wise `[128, 128]` scale factors
+- `chat_template.jinja` — M2.7's chat template with tool calling support
+- `modeling_minimax_m2.py` / `configuration_minimax_m2.py` — custom model code (requires `trust_remote_code=True`)
+## Merge code
+The full merge pipeline (forensics scan, per-tensor SLERP, FP8 dequant/requant, validation gates) is open:
+- Merge script: `merge_m25_m27.py`
+- Per-tensor schedule: `merge_core/schedule.py`
+- FP8 primitives: `merge_core/fp8_io.py`
+- SLERP: `merge_core/slerp.py`
+- Tensor classifier: `merge_core/tensor_classifier.py`
+- Benchmark harness: `bench/run_bench.py`
+## Citation
+```
+@misc{minimax-slurpy-2026,
+  title={MiniMax-SLURPY: Per-tensor empirical SLERP merge of MiniMax-M2.5 and M2.7},
+  author={Ex0bit},
+  year={2026},
+  url={https://huggingface.co/Ex0bit/MiniMax-SLURPY}
+}
+```
+## Acknowledgments
+- [MiniMax](https://www.minimaxi.com/) for the M2.5 and M2.7 base models
+- Merge infrastructure adapted from the [PRISM abliteration pipeline](https://github.com/exobit)
+- FP8 dequant/requant primitives derived from the MiniMax-M2.5-PRISM project