harshithsaiv commited on
Commit
1b38495
Β·
1 Parent(s): abfc070

Update Readme

Browse files
Files changed (1) hide show
  1. README.md +102 -34
README.md CHANGED
@@ -1,59 +1,127 @@
1
- This is why the Triton kernel is essential β€” not just faster, but the only way
2
- to realize the theoretical memory savings from 4-bit quantization.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
 
4
  ---
5
 
6
  ## How It Works
7
 
8
- **Step 1 β€” Calibrate (once, ~20 minutes)**
9
 
10
  Run 256 WikiText samples through the model. For each attention head measure
11
- reconstruction error at 4-bit and 8-bit. Save optimal bit allocation to JSON (~1KB).
12
 
13
- **Step 2 β€” Compress (every inference)**
14
 
15
- Load the bit allocation. Use Triton kernel to truly pack 4-bit heads (2 values per byte).
16
- Keep 8-bit heads at full precision. Result: genuine memory reduction.
17
 
18
- **Step 3 β€” Results**
19
 
20
- - 2.30x memory reduction on Mistral-7B (vs 2.00x for naive/uniform)
21
  - 2.04x memory reduction on Llama-3-8B
22
  - Zero perplexity degradation on both models
23
- - Same decode speed at 37 tokens/sec
24
- - Triton kernel is 10-12% faster than naive PyTorch implementation
25
 
26
  ---
27
 
28
  ## Quick Start
29
 
30
- Clone and install:
31
-
32
  git clone https://github.com/YOURUSERNAME/kv-cache-compression
33
  cd kv-cache-compression
34
  pip install -r requirements.txt
35
 
36
- Download Mistral (no approval needed):
37
-
38
  hf download mistralai/Mistral-7B-Instruct-v0.3 --local-dir ./mistral-model
39
 
40
- Download Llama (requires HuggingFace approval):
41
-
42
  hf download meta-llama/Meta-Llama-3-8B-Instruct --local-dir ./llama-model
43
 
44
- Run full pipeline:
45
-
46
  make run-mistral
47
  make run-llama
48
  make run-both
49
 
50
- Run step by step:
51
-
52
- make baseline MODEL=mistral-7b
53
- make calibrate MODEL=mistral-7b
54
- make integrate MODEL=mistral-7b
55
- make benchmark MODEL=mistral-7b
56
- make benchmark-long MODEL=mistral-7b
57
  make visualize
58
 
59
  ---
@@ -99,14 +167,14 @@ Run step by step:
99
 
100
  ## Limitations
101
 
102
- - Tested on 7-8B models only. Larger models (70B+) need validation.
103
- - Calibration uses WikiText-2. Domain-specific calibration may improve results.
104
  - Integration is HuggingFace only. vLLM integration is planned.
105
- - Llama-3-8B Triton compression (2.04x) is modest due to high head sensitivity.
106
 
107
  ---
108
 
109
- ## What's Next
110
 
111
  - vLLM PagedAttention integration
112
  - 32K and 128K context experiments
@@ -114,9 +182,9 @@ Run step by step:
114
  - Dynamic per-token bit allocation at decode time
115
  - ArXiv paper with full evaluation
116
 
117
- ---
118
 
119
- <!-- ## Citation
120
 
121
  @misc{kvcache-perhead-2026,
122
  title = {Per-Head Mixed-Precision KV Cache Compression with True Triton Bit-Packing},
@@ -129,6 +197,6 @@ Run step by step:
129
 
130
  ## License
131
 
132
- MIT β€” free to use, modify, and distribute.
133
 
134
- Built in one weekend on an A100 SXM4 40GB. Questions, issues, and PRs welcome.
 
1
+ # Per-Head Mixed-Precision KV Cache Compression
2
+
3
+ Calibrate once. Pack truly. Same quality.
4
+
5
+ Most KV cache quantization treats every attention head equally. This is wrong.
6
+ Some heads are 26x more sensitive to quantization than others. We measure this,
7
+ allocate bits per head, and use a Triton kernel to truly pack 4-bit values β€”
8
+ achieving better compression than uniform 8-bit with zero quality loss.
9
+
10
+ ---
11
+
12
+ ## Key Finding
13
+
14
+ Simply storing 4-bit values in uint8 wastes the compression benefit entirely.
15
+ True bit-packing via our Triton kernel is required to realize theoretical savings.
16
+
17
+ - Naive uint8 storage: same memory as uniform 8-bit (2.0x) β€” no benefit
18
+ - Triton true packing: genuine 2.3x compression β€” real savings on actual GPU
19
+
20
+ ---
21
+
22
+ ## Results
23
+
24
+ ![Memory vs Context](figures/memory_vs_context_4methods.png)
25
+
26
+ ![Compression](figures/compression_bar_4methods.png)
27
+
28
+ | Model | Method | KV @ 8K | vs FP16 | vs 8-bit | Perplexity | Speed |
29
+ |-------|--------|---------|---------|---------|------------|-------|
30
+ | Mistral-7B | FP16 Baseline | 1073 MB | 1.00x | β€” | 14.23 | 37.4 t/s |
31
+ | Mistral-7B | Uniform 8-bit | 537 MB | 2.00x | 1.00x | ~same | ~same |
32
+ | Mistral-7B | Naive Per-Head (uint8) | 537 MB | 2.00x | 1.00x | ~same | ~same |
33
+ | Mistral-7B | **Triton True 4-bit (Ours)** | **467 MB** | **2.30x** | **1.15x** | **14.23** | **37.4 t/s** |
34
+ | Llama-3-8B | FP16 Baseline | 1073 MB | 1.00x | β€” | 20.70 | 36.8 t/s |
35
+ | Llama-3-8B | Uniform 8-bit | 537 MB | 2.00x | 1.00x | ~same | ~same |
36
+ | Llama-3-8B | Naive Per-Head (uint8) | 537 MB | 2.00x | 1.00x | ~same | ~same |
37
+ | Llama-3-8B | **Triton True 4-bit (Ours)** | **526 MB** | **2.04x** | **1.02x** | **20.70** | **36.8 t/s** |
38
+
39
+ ---
40
+
41
+ ## Long Context Results
42
+
43
+ ![Long Context](figures/long_context_4methods.png)
44
+
45
+ ![32K Memory](figures/memory_32k_4methods.png)
46
+
47
+ | Context | FP16 | Naive (uint8) | Triton True 4-bit |
48
+ |---------|------|---------------|-------------------|
49
+ | 8K | 1,074 MB | 537 MB (2.0x) | 467 MB (2.3x) |
50
+ | 16K | 2,147 MB | 1,074 MB (2.0x) | 933 MB (2.3x) |
51
+ | 32K | 4,295 MB | 2,147 MB (2.0x) | 1,866 MB (2.3x) |
52
+
53
+ Llama-3-8B FP16 runs out of memory at 32K context. Our Triton method fits.
54
+
55
+ ---
56
+
57
+ ## The Key Insight
58
+
59
+ ![Sensitivity Heatmap](figures/mistral-7b_sensitivity_heatmap.png)
60
+
61
+ Each cell is one attention head. Darker means more sensitive β€” needs higher precision.
62
+ The variance is massive. Heads in the same layer need completely different treatment.
63
+ Uniform quantization ignores this entirely.
64
+
65
+ ---
66
+
67
+ ## Why True Bit-Packing Matters
68
+
69
+ Naive implementations store 4-bit values in uint8 β€” one full byte per value.
70
+ 65536 values = 65536 bytes = same compression as 8-bit, no additional benefit.
71
+
72
+ Our Triton kernel truly packs two 4-bit values per byte.
73
+ 65536 values = 32768 bytes = genuine 2.3x compression on actual GPU memory.
74
+
75
+ The Triton kernel is not just faster β€” it is the only way to realize
76
+ the theoretical memory savings from 4-bit quantization.
77
 
78
  ---
79
 
80
  ## How It Works
81
 
82
+ Step 1 β€” Calibrate once, around 20 minutes
83
 
84
  Run 256 WikiText samples through the model. For each attention head measure
85
+ reconstruction error at 4-bit and 8-bit. Save optimal bit allocation to JSON.
86
 
87
+ Step 2 β€” Compress every inference
88
 
89
+ Load the bit allocation. Use the Triton kernel to truly pack 4-bit heads
90
+ at two values per byte. Keep 8-bit heads at full precision.
91
 
92
+ Step 3 β€” Results
93
 
94
+ - 2.30x memory reduction on Mistral-7B vs 2.00x for naive and uniform methods
95
  - 2.04x memory reduction on Llama-3-8B
96
  - Zero perplexity degradation on both models
97
+ - Same decode speed at 37 tokens per second
98
+ - Triton kernel is 10 to 12 percent faster than naive PyTorch
99
 
100
  ---
101
 
102
  ## Quick Start
103
 
 
 
104
  git clone https://github.com/YOURUSERNAME/kv-cache-compression
105
  cd kv-cache-compression
106
  pip install -r requirements.txt
107
 
108
+ # Download Mistral (no approval needed)
 
109
  hf download mistralai/Mistral-7B-Instruct-v0.3 --local-dir ./mistral-model
110
 
111
+ # Download Llama (requires HuggingFace approval)
 
112
  hf download meta-llama/Meta-Llama-3-8B-Instruct --local-dir ./llama-model
113
 
114
+ # Run full pipeline
 
115
  make run-mistral
116
  make run-llama
117
  make run-both
118
 
119
+ # Or step by step
120
+ make baseline MODEL=mistral-7b
121
+ make calibrate MODEL=mistral-7b
122
+ make integrate MODEL=mistral-7b
123
+ make benchmark MODEL=mistral-7b
124
+ make benchmark-long MODEL=mistral-7b
 
125
  make visualize
126
 
127
  ---
 
167
 
168
  ## Limitations
169
 
170
+ - Tested on 7-8B models only. Larger models need validation.
171
+ - Calibration uses WikiText-2. Domain-specific data may improve results.
172
  - Integration is HuggingFace only. vLLM integration is planned.
173
+ - Llama-3-8B compression is modest due to higher head sensitivity.
174
 
175
  ---
176
 
177
+ ## What Is Next
178
 
179
  - vLLM PagedAttention integration
180
  - 32K and 128K context experiments
 
182
  - Dynamic per-token bit allocation at decode time
183
  - ArXiv paper with full evaluation
184
 
185
+ <!-- ---
186
 
187
+ ## Citation
188
 
189
  @misc{kvcache-perhead-2026,
190
  title = {Per-Head Mixed-Precision KV Cache Compression with True Triton Bit-Packing},
 
197
 
198
  ## License
199
 
200
+ MIT. Free to use, modify, and distribute.
201
 
202
+ Built in one week on an A100 SXM4 40GB. Questions, issues, and PRs welcome.