cryptobiosis commited on
Commit
3ef5450
·
verified ·
1 Parent(s): a80df25

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +27 -17
README.md CHANGED
@@ -8,27 +8,27 @@ app_port: 7860
8
  pinned: false
9
  license: apache-2.0
10
  ---
11
-
12
  # KakeyaLattice KV-cache compression demo
13
-
14
  Side-by-side comparison of **bf16 DynamicCache** vs **KakeyaLattice E8**
15
  compression at three quality levels (Q=10 aggressive, Q=38 balanced,
16
  Q=152 near-lossless) on a small HuggingFace causal LM.
17
-
18
  Default model: `Qwen/Qwen2-0.5B` (head_dim=64, E8-compatible, runs on
19
  free CPU tier). Override `KAKEYA_DEMO_MODEL` env var to use a larger
20
  model on a GPU Space.
21
-
22
  ## How it works
23
-
24
  `KakeyaLatticeCache` is a drop-in subclass of `transformers.DynamicCache`
25
  that applies a Zamir-Feder nested-lattice codec roundtrip (encode +
26
  decode) to every K and V written into the cache.
27
-
28
- python
29
  from transformers import AutoModelForCausalLM
30
  from kakeyalattice.hf import KakeyaLatticeCache
31
-
32
  model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B")
33
  cache = KakeyaLatticeCache(
34
  variant="e8", q_range=38,
@@ -36,19 +36,29 @@ cache = KakeyaLatticeCache(
36
  head_dim=model.config.head_dim,
37
  )
38
  out = model.generate(input_ids, max_new_tokens=200, past_key_values=cache)
39
-
40
-
 
 
 
 
 
 
 
 
 
 
 
41
  ## Caveats
42
-
43
  1. The cache **roundtrips** K/V but stores the reconstructed tensor in
44
- the model's KV dtype. Real HBM bytes saved are nominal — the demo's
45
- value is showing reconstruction quality, not memory savings.
46
  2. Decode is ~1.3-2× slower than bf16 because the codec runs as pure
47
- PyTorch ops. A fused Triton kernel would close this gap.
48
  3. Head-dim must be a power of 2 and divisible by 4 (D4) or 8 (E8).
49
-
50
  ## Links
51
-
52
  - Package: https://pypi.org/project/kakeyalattice/
53
  - Repo: https://github.com/FluffyAIcode/LLM-KV--Cache-compress
54
- - Paper: [reports/paper/](https://github.com/FluffyAIcode/LLM-KV--Cache-compress/tree/main/reports/paper)
 
8
  pinned: false
9
  license: apache-2.0
10
  ---
11
+
12
  # KakeyaLattice KV-cache compression demo
13
+
14
  Side-by-side comparison of **bf16 DynamicCache** vs **KakeyaLattice E8**
15
  compression at three quality levels (Q=10 aggressive, Q=38 balanced,
16
  Q=152 near-lossless) on a small HuggingFace causal LM.
17
+
18
  Default model: `Qwen/Qwen2-0.5B` (head_dim=64, E8-compatible, runs on
19
  free CPU tier). Override `KAKEYA_DEMO_MODEL` env var to use a larger
20
  model on a GPU Space.
21
+
22
  ## How it works
23
+
24
  `KakeyaLatticeCache` is a drop-in subclass of `transformers.DynamicCache`
25
  that applies a Zamir-Feder nested-lattice codec roundtrip (encode +
26
  decode) to every K and V written into the cache.
27
+
28
+ ```python
29
  from transformers import AutoModelForCausalLM
30
  from kakeyalattice.hf import KakeyaLatticeCache
31
+
32
  model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B")
33
  cache = KakeyaLatticeCache(
34
  variant="e8", q_range=38,
 
36
  head_dim=model.config.head_dim,
37
  )
38
  out = model.generate(input_ids, max_new_tokens=200, past_key_values=cache)
39
+ ```
40
+
41
+ ## What you'll see in the demo
42
+
43
+ For each prompt, the app generates four times:
44
+
45
+ | config | bits/token | expected quality |
46
+ | --- | --- | --- |
47
+ | bf16 DynamicCache | 1024 (reference) | identical to reference |
48
+ | E8 Q=152 near-lossless | ~960 (-6%) | essentially identical |
49
+ | E8 Q=38 balanced | ~440 (-57%) | ~1% deviation in ppl |
50
+ | E8 Q=10 aggressive | ~320 (-69%) | noticeably different but coherent |
51
+
52
  ## Caveats
53
+
54
  1. The cache **roundtrips** K/V but stores the reconstructed tensor in
55
+ the model's KV dtype. Real HBM bytes saved are **nominal**.
 
56
  2. Decode is ~1.3-2× slower than bf16 because the codec runs as pure
57
+ PyTorch ops.
58
  3. Head-dim must be a power of 2 and divisible by 4 (D4) or 8 (E8).
59
+
60
  ## Links
61
+
62
  - Package: https://pypi.org/project/kakeyalattice/
63
  - Repo: https://github.com/FluffyAIcode/LLM-KV--Cache-compress
64
+ - Paper: [reports/paper/](https://github.com/FluffyAIcode/LLM-KV--Cache-compress/tree/main/reports/paper)