Harshith Sai Veeraiah commited on
Update README.md
Browse files
README.md
CHANGED
|
@@ -21,9 +21,9 @@ True bit-packing via our Triton kernel is required to realize theoretical saving
|
|
| 21 |
|
| 22 |
## Results
|
| 23 |
|
| 24 |
-
 | Triton True 4-bit |
|
| 48 |
|---------|------|---------------|-------------------|
|
|
@@ -56,7 +56,7 @@ Llama-3-8B FP16 runs out of memory at 32K context. Our Triton method fits.
|
|
| 56 |
|
| 57 |
## The Key Insight
|
| 58 |
|
| 59 |
-

|
| 25 |
|
| 26 |
+

|
| 27 |
|
| 28 |
| Model | Method | KV @ 8K | vs FP16 | vs 8-bit | Perplexity | Speed |
|
| 29 |
|-------|--------|---------|---------|---------|------------|-------|
|
|
|
|
| 40 |
|
| 41 |
## Long Context Results
|
| 42 |
|
| 43 |
+

|
| 44 |
|
| 45 |
+

|
| 46 |
|
| 47 |
| Context | FP16 | Naive (uint8) | Triton True 4-bit |
|
| 48 |
|---------|------|---------------|-------------------|
|
|
|
|
| 56 |
|
| 57 |
## The Key Insight
|
| 58 |
|
| 59 |
+

|
| 60 |
|
| 61 |
Each cell is one attention head. Darker means more sensitive — needs higher precision.
|
| 62 |
The variance is massive. Heads in the same layer need completely different treatment.
|
|
|
|
| 101 |
|
| 102 |
## Quick Start
|
| 103 |
|
| 104 |
+
git clone https://github.com/harshithsaiv/kv-cache-compression
|
| 105 |
cd kv-cache-compression
|
| 106 |
pip install -r requirements.txt
|
| 107 |
|