harshithsaiv
/

kv-cache-compression

memory-efficient

inference-optimization

4-bit precision

mixed-precision

Model card Files Files and versions

Harshith Sai Veeraiah commited on 10 days ago

Commit

38e5430

·

verified ·

1 Parent(s): 598ef59

Update README.md

Files changed (1) hide show

README.md +6 -6

README.md CHANGED Viewed

@@ -21,9 +21,9 @@ True bit-packing via our Triton kernel is required to realize theoretical saving
 ## Results
-![Memory vs Context](figures/memory_vs_context_4methods.png)
-![Compression](figures/compression_bar_4methods.png)
 | Model | Method | KV @ 8K | vs FP16 | vs 8-bit | Perplexity | Speed |
 |-------|--------|---------|---------|---------|------------|-------|
@@ -40,9 +40,9 @@ True bit-packing via our Triton kernel is required to realize theoretical saving
 ## Long Context Results
-![Long Context](figures/long_context_4methods.png)
-![32K Memory](figures/memory_32k_4methods.png)
 | Context | FP16 | Naive (uint8) | Triton True 4-bit |
 |---------|------|---------------|-------------------|
@@ -56,7 +56,7 @@ Llama-3-8B FP16 runs out of memory at 32K context. Our Triton method fits.
 ## The Key Insight
-![Sensitivity Heatmap](figures/mistral-7b_sensitivity_heatmap.png)
 Each cell is one attention head. Darker means more sensitive — needs higher precision.
 The variance is massive. Heads in the same layer need completely different treatment.
@@ -101,7 +101,7 @@ Step 3 — Results
 ## Quick Start
-    git clone https://github.com/YOURUSERNAME/kv-cache-compression
     cd kv-cache-compression
     pip install -r requirements.txt

 ## Results
+![Memory vs Context](memory_vs_context_4methods.png)
+![Compression](compression_bar_4methods.png)
 | Model | Method | KV @ 8K | vs FP16 | vs 8-bit | Perplexity | Speed |
 |-------|--------|---------|---------|---------|------------|-------|
 ## Long Context Results
+![Long Context](long_context_4methods.png)
+![32K Memory](memory_32k_4methods.png)
 | Context | FP16 | Naive (uint8) | Triton True 4-bit |
 |---------|------|---------------|-------------------|
 ## The Key Insight
+![Sensitivity Heatmap](mistral-7b_sensitivity_heatmap.png)
 Each cell is one attention head. Darker means more sensitive — needs higher precision.
 The variance is massive. Heads in the same layer need completely different treatment.
 ## Quick Start
+    git clone https://github.com/harshithsaiv/kv-cache-compression
     cd kv-cache-compression
     pip install -r requirements.txt