Spaces:

FluffyAIcode
/

LLM-KA-Cache-Compress

Sleeping

App Files Files Community

cryptobiosis commited on 19 days ago

Commit

3f31c94

verified ·

1 Parent(s): 49d3638

Update README.md

Browse files

Files changed (1) hide show

README.md +49 -7

README.md CHANGED Viewed

@@ -1,12 +1,54 @@
 ---
-title: LLM KA Cache Compress
-emoji: 📚
-colorFrom: pink
-colorTo: indigo
 sdk: docker
 pinned: false
-license: mit
-short_description: Best LLM-KV-Cache Compression with Discrete Kakeya set
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: KakeyaLattice KV-cache compression
+emoji: 📐
+colorFrom: indigo
+colorTo: purple
 sdk: docker
+app_port: 7860
 pinned: false
+license: apache-2.0
 ---
+# KakeyaLattice KV-cache compression demo
+Side-by-side comparison of **bf16 DynamicCache** vs **KakeyaLattice E8**
+compression at three quality levels (Q=10 aggressive, Q=38 balanced,
+Q=152 near-lossless) on a small HuggingFace causal LM.
+Default model: `Qwen/Qwen2-0.5B` (head_dim=64, E8-compatible, runs on
+free CPU tier). Override `KAKEYA_DEMO_MODEL` env var to use a larger
+model on a GPU Space.
+## How it works
+`KakeyaLatticeCache` is a drop-in subclass of `transformers.DynamicCache`
+that applies a Zamir-Feder nested-lattice codec roundtrip (encode +
+decode) to every K and V written into the cache.
+\`\`\`python
+from transformers import AutoModelForCausalLM
+from kakeyalattice.hf import KakeyaLatticeCache
+model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B")
+cache = KakeyaLatticeCache(
+    variant="e8", q_range=38,
+    num_hidden_layers=model.config.num_hidden_layers,
+    head_dim=model.config.head_dim,
+)
+out = model.generate(input_ids, max_new_tokens=200, past_key_values=cache)
+\`\`\`
+## Caveats
+1. The cache **roundtrips** K/V but stores the reconstructed tensor in
+   the model's KV dtype. Real HBM bytes saved are nominal — the demo's
+   value is showing reconstruction quality, not memory savings.
+2. Decode is ~1.3-2× slower than bf16 because the codec runs as pure
+   PyTorch ops. A fused Triton kernel would close this gap.
+3. Head-dim must be a power of 2 and divisible by 4 (D4) or 8 (E8).
+## Links
+- Package: https://pypi.org/project/kakeyalattice/
+- Repo: https://github.com/FluffyAIcode/LLM-KV--Cache-compress
+- Paper: [reports/paper/](https://github.com/FluffyAIcode/LLM-KV--Cache-compress/tree/main/reports/paper)