majentik
/

MiniMax-M2.7-RotorQuant

+---
+base_model: MiniMaxAI/MiniMax-M2.7
+library_name: transformers
+pipeline_tag: text-generation
+license: other
+license_name: minimax-model-license
+license_link: https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/LICENSE
+tags:
+  - minimax
+  - m2.7
+  - moe
+  - quantized
+  - rotorquant
+  - kv-cache-quantization
+---
+# MiniMax-M2.7-RotorQuant
+**KV-cache quantized variant of [MiniMaxAI/MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) using RotorQuant compression.**
+## Overview
+MiniMax-M2.7 is a massive 256-expert Mixture-of-Experts (MoE) model with 8 experts active per token, totaling approximately 456 billion parameters. This variant applies **RotorQuant** KV-cache quantization, which uses Hadamard rotation transforms to distribute outlier magnitudes before quantizing the KV cache.
+RotorQuant applies a learned rotation matrix (Hadamard transform) to keys and values before quantization, smoothing the activation distribution. This yields better quality retention than naive per-channel methods, especially at aggressive quantization levels.
+| Property | Value |
+|---|---|
+| Architecture | MoE (256 experts, 8 active/token) |
+| Total Parameters | ~456B |
+| Layers | 62 |
+| Hidden Size | 3072 |
+| Attention Heads | 48 |
+| Quantization | RotorQuant (KV-cache) |
+| Base Model | MiniMaxAI/MiniMax-M2.7 |
+## Quickstart
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_id = "majentik/MiniMax-M2.7-RotorQuant"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    device_map="auto",
+    torch_dtype="auto",
+)
+# Enable RotorQuant (IsoQuant) KV-cache quantization
+from transformers import IsoQuantCache
+past_key_values = IsoQuantCache(model.config)
+messages = [{"role": "user", "content": "What is a Comprehensive Geriatric Assessment?"}]
+inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
+outputs = model.generate(
+    inputs,
+    past_key_values=past_key_values,
+    max_new_tokens=512,
+)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+## RotorQuant vs TurboQuant
+| Feature | RotorQuant | TurboQuant |
+|---|---|---|
+| Technique | Rotation-based KV quantization (Hadamard transform) | Asymmetric per-channel KV quantization |
+| Throughput | Slightly lower throughput (rotation overhead) | Higher throughput, lower latency |
+| Quality | Better quality preservation at low bit-widths | Good quality preservation |
+| Best For | Quality-sensitive tasks, research | High-throughput serving, long contexts |
+## Memory Estimates (Apple Silicon)
+Given the massive 456B parameter count, even with KV-cache quantization the model weights dominate memory usage. KV-cache quantization primarily helps with long-context inference overhead.
+| Configuration | Estimated Memory |
+|---|---|
+| FP16 weights + RotorQuant KV | ~912 GB |
+| 8-bit weights + RotorQuant KV | ~456 GB |
+| 4-bit weights + RotorQuant KV | ~228 GB |
+> **Note**: This model requires substantial hardware. For Apple Silicon deployment with reduced memory, see the MLX quantized variants.
+## See Also
+- [MiniMaxAI/MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) -- Base model
+- [majentik/MiniMax-M2.7-TurboQuant](https://huggingface.co/majentik/MiniMax-M2.7-TurboQuant) -- TurboQuant KV-cache variant
+- [majentik/MiniMax-M2.7-RotorQuant-MLX-8bit](https://huggingface.co/majentik/MiniMax-M2.7-RotorQuant-MLX-8bit) -- MLX 8-bit
+- [majentik/MiniMax-M2.7-RotorQuant-MLX-4bit](https://huggingface.co/majentik/MiniMax-M2.7-RotorQuant-MLX-4bit) -- MLX 4-bit
+- [majentik/MiniMax-M2.7-RotorQuant-MLX-3bit](https://huggingface.co/majentik/MiniMax-M2.7-RotorQuant-MLX-3bit) -- MLX 3-bit