Harshith Sai Veeraiah commited on
Commit
38e5430
·
verified ·
1 Parent(s): 598ef59

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -6
README.md CHANGED
@@ -21,9 +21,9 @@ True bit-packing via our Triton kernel is required to realize theoretical saving
21
 
22
  ## Results
23
 
24
- ![Memory vs Context](figures/memory_vs_context_4methods.png)
25
 
26
- ![Compression](figures/compression_bar_4methods.png)
27
 
28
  | Model | Method | KV @ 8K | vs FP16 | vs 8-bit | Perplexity | Speed |
29
  |-------|--------|---------|---------|---------|------------|-------|
@@ -40,9 +40,9 @@ True bit-packing via our Triton kernel is required to realize theoretical saving
40
 
41
  ## Long Context Results
42
 
43
- ![Long Context](figures/long_context_4methods.png)
44
 
45
- ![32K Memory](figures/memory_32k_4methods.png)
46
 
47
  | Context | FP16 | Naive (uint8) | Triton True 4-bit |
48
  |---------|------|---------------|-------------------|
@@ -56,7 +56,7 @@ Llama-3-8B FP16 runs out of memory at 32K context. Our Triton method fits.
56
 
57
  ## The Key Insight
58
 
59
- ![Sensitivity Heatmap](figures/mistral-7b_sensitivity_heatmap.png)
60
 
61
  Each cell is one attention head. Darker means more sensitive — needs higher precision.
62
  The variance is massive. Heads in the same layer need completely different treatment.
@@ -101,7 +101,7 @@ Step 3 — Results
101
 
102
  ## Quick Start
103
 
104
- git clone https://github.com/YOURUSERNAME/kv-cache-compression
105
  cd kv-cache-compression
106
  pip install -r requirements.txt
107
 
 
21
 
22
  ## Results
23
 
24
+ ![Memory vs Context](memory_vs_context_4methods.png)
25
 
26
+ ![Compression](compression_bar_4methods.png)
27
 
28
  | Model | Method | KV @ 8K | vs FP16 | vs 8-bit | Perplexity | Speed |
29
  |-------|--------|---------|---------|---------|------------|-------|
 
40
 
41
  ## Long Context Results
42
 
43
+ ![Long Context](long_context_4methods.png)
44
 
45
+ ![32K Memory](memory_32k_4methods.png)
46
 
47
  | Context | FP16 | Naive (uint8) | Triton True 4-bit |
48
  |---------|------|---------------|-------------------|
 
56
 
57
  ## The Key Insight
58
 
59
+ ![Sensitivity Heatmap](mistral-7b_sensitivity_heatmap.png)
60
 
61
  Each cell is one attention head. Darker means more sensitive — needs higher precision.
62
  The variance is massive. Heads in the same layer need completely different treatment.
 
101
 
102
  ## Quick Start
103
 
104
+ git clone https://github.com/harshithsaiv/kv-cache-compression
105
  cd kv-cache-compression
106
  pip install -r requirements.txt
107