Update README.md
Browse files
README.md
CHANGED
|
@@ -8,27 +8,27 @@ app_port: 7860
|
|
| 8 |
pinned: false
|
| 9 |
license: apache-2.0
|
| 10 |
---
|
| 11 |
-
|
| 12 |
# KakeyaLattice KV-cache compression demo
|
| 13 |
-
|
| 14 |
Side-by-side comparison of **bf16 DynamicCache** vs **KakeyaLattice E8**
|
| 15 |
compression at three quality levels (Q=10 aggressive, Q=38 balanced,
|
| 16 |
Q=152 near-lossless) on a small HuggingFace causal LM.
|
| 17 |
-
|
| 18 |
Default model: `Qwen/Qwen2-0.5B` (head_dim=64, E8-compatible, runs on
|
| 19 |
free CPU tier). Override `KAKEYA_DEMO_MODEL` env var to use a larger
|
| 20 |
model on a GPU Space.
|
| 21 |
-
|
| 22 |
## How it works
|
| 23 |
-
|
| 24 |
`KakeyaLatticeCache` is a drop-in subclass of `transformers.DynamicCache`
|
| 25 |
that applies a Zamir-Feder nested-lattice codec roundtrip (encode +
|
| 26 |
decode) to every K and V written into the cache.
|
| 27 |
-
|
| 28 |
-
python
|
| 29 |
from transformers import AutoModelForCausalLM
|
| 30 |
from kakeyalattice.hf import KakeyaLatticeCache
|
| 31 |
-
|
| 32 |
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B")
|
| 33 |
cache = KakeyaLatticeCache(
|
| 34 |
variant="e8", q_range=38,
|
|
@@ -36,19 +36,29 @@ cache = KakeyaLatticeCache(
|
|
| 36 |
head_dim=model.config.head_dim,
|
| 37 |
)
|
| 38 |
out = model.generate(input_ids, max_new_tokens=200, past_key_values=cache)
|
| 39 |
-
|
| 40 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
## Caveats
|
| 42 |
-
|
| 43 |
1. The cache **roundtrips** K/V but stores the reconstructed tensor in
|
| 44 |
-
the model's KV dtype. Real HBM bytes saved are nominal
|
| 45 |
-
value is showing reconstruction quality, not memory savings.
|
| 46 |
2. Decode is ~1.3-2× slower than bf16 because the codec runs as pure
|
| 47 |
-
PyTorch ops.
|
| 48 |
3. Head-dim must be a power of 2 and divisible by 4 (D4) or 8 (E8).
|
| 49 |
-
|
| 50 |
## Links
|
| 51 |
-
|
| 52 |
- Package: https://pypi.org/project/kakeyalattice/
|
| 53 |
- Repo: https://github.com/FluffyAIcode/LLM-KV--Cache-compress
|
| 54 |
-
- Paper: [reports/paper/](https://github.com/FluffyAIcode/LLM-KV--Cache-compress/tree/main/reports/paper)
|
|
|
|
| 8 |
pinned: false
|
| 9 |
license: apache-2.0
|
| 10 |
---
|
| 11 |
+
|
| 12 |
# KakeyaLattice KV-cache compression demo
|
| 13 |
+
|
| 14 |
Side-by-side comparison of **bf16 DynamicCache** vs **KakeyaLattice E8**
|
| 15 |
compression at three quality levels (Q=10 aggressive, Q=38 balanced,
|
| 16 |
Q=152 near-lossless) on a small HuggingFace causal LM.
|
| 17 |
+
|
| 18 |
Default model: `Qwen/Qwen2-0.5B` (head_dim=64, E8-compatible, runs on
|
| 19 |
free CPU tier). Override `KAKEYA_DEMO_MODEL` env var to use a larger
|
| 20 |
model on a GPU Space.
|
| 21 |
+
|
| 22 |
## How it works
|
| 23 |
+
|
| 24 |
`KakeyaLatticeCache` is a drop-in subclass of `transformers.DynamicCache`
|
| 25 |
that applies a Zamir-Feder nested-lattice codec roundtrip (encode +
|
| 26 |
decode) to every K and V written into the cache.
|
| 27 |
+
|
| 28 |
+
```python
|
| 29 |
from transformers import AutoModelForCausalLM
|
| 30 |
from kakeyalattice.hf import KakeyaLatticeCache
|
| 31 |
+
|
| 32 |
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B")
|
| 33 |
cache = KakeyaLatticeCache(
|
| 34 |
variant="e8", q_range=38,
|
|
|
|
| 36 |
head_dim=model.config.head_dim,
|
| 37 |
)
|
| 38 |
out = model.generate(input_ids, max_new_tokens=200, past_key_values=cache)
|
| 39 |
+
```
|
| 40 |
+
|
| 41 |
+
## What you'll see in the demo
|
| 42 |
+
|
| 43 |
+
For each prompt, the app generates four times:
|
| 44 |
+
|
| 45 |
+
| config | bits/token | expected quality |
|
| 46 |
+
| --- | --- | --- |
|
| 47 |
+
| bf16 DynamicCache | 1024 (reference) | identical to reference |
|
| 48 |
+
| E8 Q=152 near-lossless | ~960 (-6%) | essentially identical |
|
| 49 |
+
| E8 Q=38 balanced | ~440 (-57%) | ~1% deviation in ppl |
|
| 50 |
+
| E8 Q=10 aggressive | ~320 (-69%) | noticeably different but coherent |
|
| 51 |
+
|
| 52 |
## Caveats
|
| 53 |
+
|
| 54 |
1. The cache **roundtrips** K/V but stores the reconstructed tensor in
|
| 55 |
+
the model's KV dtype. Real HBM bytes saved are **nominal**.
|
|
|
|
| 56 |
2. Decode is ~1.3-2× slower than bf16 because the codec runs as pure
|
| 57 |
+
PyTorch ops.
|
| 58 |
3. Head-dim must be a power of 2 and divisible by 4 (D4) or 8 (E8).
|
| 59 |
+
|
| 60 |
## Links
|
| 61 |
+
|
| 62 |
- Package: https://pypi.org/project/kakeyalattice/
|
| 63 |
- Repo: https://github.com/FluffyAIcode/LLM-KV--Cache-compress
|
| 64 |
+
- Paper: [reports/paper/](https://github.com/FluffyAIcode/LLM-KV--Cache-compress/tree/main/reports/paper)
|