cryptobiosis commited on
Commit
3f31c94
·
verified ·
1 Parent(s): 49d3638

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +49 -7
README.md CHANGED
@@ -1,12 +1,54 @@
1
  ---
2
- title: LLM KA Cache Compress
3
- emoji: 📚
4
- colorFrom: pink
5
- colorTo: indigo
6
  sdk: docker
 
7
  pinned: false
8
- license: mit
9
- short_description: Best LLM-KV-Cache Compression with Discrete Kakeya set
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: KakeyaLattice KV-cache compression
3
+ emoji: 📐
4
+ colorFrom: indigo
5
+ colorTo: purple
6
  sdk: docker
7
+ app_port: 7860
8
  pinned: false
9
+ license: apache-2.0
 
10
  ---
11
 
12
+ # KakeyaLattice KV-cache compression demo
13
+
14
+ Side-by-side comparison of **bf16 DynamicCache** vs **KakeyaLattice E8**
15
+ compression at three quality levels (Q=10 aggressive, Q=38 balanced,
16
+ Q=152 near-lossless) on a small HuggingFace causal LM.
17
+
18
+ Default model: `Qwen/Qwen2-0.5B` (head_dim=64, E8-compatible, runs on
19
+ free CPU tier). Override `KAKEYA_DEMO_MODEL` env var to use a larger
20
+ model on a GPU Space.
21
+
22
+ ## How it works
23
+
24
+ `KakeyaLatticeCache` is a drop-in subclass of `transformers.DynamicCache`
25
+ that applies a Zamir-Feder nested-lattice codec roundtrip (encode +
26
+ decode) to every K and V written into the cache.
27
+
28
+ \`\`\`python
29
+ from transformers import AutoModelForCausalLM
30
+ from kakeyalattice.hf import KakeyaLatticeCache
31
+
32
+ model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B")
33
+ cache = KakeyaLatticeCache(
34
+ variant="e8", q_range=38,
35
+ num_hidden_layers=model.config.num_hidden_layers,
36
+ head_dim=model.config.head_dim,
37
+ )
38
+ out = model.generate(input_ids, max_new_tokens=200, past_key_values=cache)
39
+ \`\`\`
40
+
41
+ ## Caveats
42
+
43
+ 1. The cache **roundtrips** K/V but stores the reconstructed tensor in
44
+ the model's KV dtype. Real HBM bytes saved are nominal — the demo's
45
+ value is showing reconstruction quality, not memory savings.
46
+ 2. Decode is ~1.3-2× slower than bf16 because the codec runs as pure
47
+ PyTorch ops. A fused Triton kernel would close this gap.
48
+ 3. Head-dim must be a power of 2 and divisible by 4 (D4) or 8 (E8).
49
+
50
+ ## Links
51
+
52
+ - Package: https://pypi.org/project/kakeyalattice/
53
+ - Repo: https://github.com/FluffyAIcode/LLM-KV--Cache-compress
54
+ - Paper: [reports/paper/](https://github.com/FluffyAIcode/LLM-KV--Cache-compress/tree/main/reports/paper)