Spaces:
Sleeping
Sleeping
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,12 +1,54 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
-
emoji:
|
| 4 |
-
colorFrom:
|
| 5 |
-
colorTo:
|
| 6 |
sdk: docker
|
|
|
|
| 7 |
pinned: false
|
| 8 |
-
license:
|
| 9 |
-
short_description: Best LLM-KV-Cache Compression with Discrete Kakeya set
|
| 10 |
---
|
| 11 |
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: KakeyaLattice KV-cache compression
|
| 3 |
+
emoji: 📐
|
| 4 |
+
colorFrom: indigo
|
| 5 |
+
colorTo: purple
|
| 6 |
sdk: docker
|
| 7 |
+
app_port: 7860
|
| 8 |
pinned: false
|
| 9 |
+
license: apache-2.0
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
+
# KakeyaLattice KV-cache compression demo
|
| 13 |
+
|
| 14 |
+
Side-by-side comparison of **bf16 DynamicCache** vs **KakeyaLattice E8**
|
| 15 |
+
compression at three quality levels (Q=10 aggressive, Q=38 balanced,
|
| 16 |
+
Q=152 near-lossless) on a small HuggingFace causal LM.
|
| 17 |
+
|
| 18 |
+
Default model: `Qwen/Qwen2-0.5B` (head_dim=64, E8-compatible, runs on
|
| 19 |
+
free CPU tier). Override `KAKEYA_DEMO_MODEL` env var to use a larger
|
| 20 |
+
model on a GPU Space.
|
| 21 |
+
|
| 22 |
+
## How it works
|
| 23 |
+
|
| 24 |
+
`KakeyaLatticeCache` is a drop-in subclass of `transformers.DynamicCache`
|
| 25 |
+
that applies a Zamir-Feder nested-lattice codec roundtrip (encode +
|
| 26 |
+
decode) to every K and V written into the cache.
|
| 27 |
+
|
| 28 |
+
\`\`\`python
|
| 29 |
+
from transformers import AutoModelForCausalLM
|
| 30 |
+
from kakeyalattice.hf import KakeyaLatticeCache
|
| 31 |
+
|
| 32 |
+
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B")
|
| 33 |
+
cache = KakeyaLatticeCache(
|
| 34 |
+
variant="e8", q_range=38,
|
| 35 |
+
num_hidden_layers=model.config.num_hidden_layers,
|
| 36 |
+
head_dim=model.config.head_dim,
|
| 37 |
+
)
|
| 38 |
+
out = model.generate(input_ids, max_new_tokens=200, past_key_values=cache)
|
| 39 |
+
\`\`\`
|
| 40 |
+
|
| 41 |
+
## Caveats
|
| 42 |
+
|
| 43 |
+
1. The cache **roundtrips** K/V but stores the reconstructed tensor in
|
| 44 |
+
the model's KV dtype. Real HBM bytes saved are nominal — the demo's
|
| 45 |
+
value is showing reconstruction quality, not memory savings.
|
| 46 |
+
2. Decode is ~1.3-2× slower than bf16 because the codec runs as pure
|
| 47 |
+
PyTorch ops. A fused Triton kernel would close this gap.
|
| 48 |
+
3. Head-dim must be a power of 2 and divisible by 4 (D4) or 8 (E8).
|
| 49 |
+
|
| 50 |
+
## Links
|
| 51 |
+
|
| 52 |
+
- Package: https://pypi.org/project/kakeyalattice/
|
| 53 |
+
- Repo: https://github.com/FluffyAIcode/LLM-KV--Cache-compress
|
| 54 |
+
- Paper: [reports/paper/](https://github.com/FluffyAIcode/LLM-KV--Cache-compress/tree/main/reports/paper)
|