majentik commited on
Commit
d669d43
·
verified ·
1 Parent(s): 96caff5

Add KV-cache card

Browse files
Files changed (1) hide show
  1. README.md +94 -0
README.md ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: MiniMaxAI/MiniMax-M2.7
3
+ library_name: transformers
4
+ pipeline_tag: text-generation
5
+ license: other
6
+ license_name: minimax-model-license
7
+ license_link: https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/LICENSE
8
+ tags:
9
+ - minimax
10
+ - m2.7
11
+ - moe
12
+ - quantized
13
+ - rotorquant
14
+ - kv-cache-quantization
15
+ ---
16
+
17
+ # MiniMax-M2.7-RotorQuant
18
+
19
+ **KV-cache quantized variant of [MiniMaxAI/MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) using RotorQuant compression.**
20
+
21
+ ## Overview
22
+
23
+ MiniMax-M2.7 is a massive 256-expert Mixture-of-Experts (MoE) model with 8 experts active per token, totaling approximately 456 billion parameters. This variant applies **RotorQuant** KV-cache quantization, which uses Hadamard rotation transforms to distribute outlier magnitudes before quantizing the KV cache.
24
+
25
+ RotorQuant applies a learned rotation matrix (Hadamard transform) to keys and values before quantization, smoothing the activation distribution. This yields better quality retention than naive per-channel methods, especially at aggressive quantization levels.
26
+
27
+ | Property | Value |
28
+ |---|---|
29
+ | Architecture | MoE (256 experts, 8 active/token) |
30
+ | Total Parameters | ~456B |
31
+ | Layers | 62 |
32
+ | Hidden Size | 3072 |
33
+ | Attention Heads | 48 |
34
+ | Quantization | RotorQuant (KV-cache) |
35
+ | Base Model | MiniMaxAI/MiniMax-M2.7 |
36
+
37
+ ## Quickstart
38
+
39
+ ```python
40
+ from transformers import AutoModelForCausalLM, AutoTokenizer
41
+
42
+ model_id = "majentik/MiniMax-M2.7-RotorQuant"
43
+
44
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
45
+ model = AutoModelForCausalLM.from_pretrained(
46
+ model_id,
47
+ device_map="auto",
48
+ torch_dtype="auto",
49
+ )
50
+
51
+ # Enable RotorQuant (IsoQuant) KV-cache quantization
52
+ from transformers import IsoQuantCache
53
+
54
+ past_key_values = IsoQuantCache(model.config)
55
+
56
+ messages = [{"role": "user", "content": "What is a Comprehensive Geriatric Assessment?"}]
57
+ inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
58
+
59
+ outputs = model.generate(
60
+ inputs,
61
+ past_key_values=past_key_values,
62
+ max_new_tokens=512,
63
+ )
64
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
65
+ ```
66
+
67
+ ## RotorQuant vs TurboQuant
68
+
69
+ | Feature | RotorQuant | TurboQuant |
70
+ |---|---|---|
71
+ | Technique | Rotation-based KV quantization (Hadamard transform) | Asymmetric per-channel KV quantization |
72
+ | Throughput | Slightly lower throughput (rotation overhead) | Higher throughput, lower latency |
73
+ | Quality | Better quality preservation at low bit-widths | Good quality preservation |
74
+ | Best For | Quality-sensitive tasks, research | High-throughput serving, long contexts |
75
+
76
+ ## Memory Estimates (Apple Silicon)
77
+
78
+ Given the massive 456B parameter count, even with KV-cache quantization the model weights dominate memory usage. KV-cache quantization primarily helps with long-context inference overhead.
79
+
80
+ | Configuration | Estimated Memory |
81
+ |---|---|
82
+ | FP16 weights + RotorQuant KV | ~912 GB |
83
+ | 8-bit weights + RotorQuant KV | ~456 GB |
84
+ | 4-bit weights + RotorQuant KV | ~228 GB |
85
+
86
+ > **Note**: This model requires substantial hardware. For Apple Silicon deployment with reduced memory, see the MLX quantized variants.
87
+
88
+ ## See Also
89
+
90
+ - [MiniMaxAI/MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) -- Base model
91
+ - [majentik/MiniMax-M2.7-TurboQuant](https://huggingface.co/majentik/MiniMax-M2.7-TurboQuant) -- TurboQuant KV-cache variant
92
+ - [majentik/MiniMax-M2.7-RotorQuant-MLX-8bit](https://huggingface.co/majentik/MiniMax-M2.7-RotorQuant-MLX-8bit) -- MLX 8-bit
93
+ - [majentik/MiniMax-M2.7-RotorQuant-MLX-4bit](https://huggingface.co/majentik/MiniMax-M2.7-RotorQuant-MLX-4bit) -- MLX 4-bit
94
+ - [majentik/MiniMax-M2.7-RotorQuant-MLX-3bit](https://huggingface.co/majentik/MiniMax-M2.7-RotorQuant-MLX-3bit) -- MLX 3-bit