Spaces:

FluffyAIcode
/

LLM-KA-Cache-Compress

Running

App Files Files Community

cryptobiosis commited on 11 days ago

Commit

23d974d

verified ·

1 Parent(s): 2b572cf

Switch default model to Qwen3-0.6B (head_dim=128, GQA); bump transformers>=4.51

Browse files

Files changed (3) hide show

README.md +19 -11
app.py +17 -11
requirements.txt +1 -1

README.md CHANGED Viewed

@@ -15,9 +15,12 @@ Side-by-side comparison of **bf16 DynamicCache** vs **KakeyaLattice E8**
 compression at three quality levels (Q=10 aggressive, Q=38 balanced,
 Q=152 near-lossless) on a small HuggingFace causal LM.
-Default model: `Qwen/Qwen2-0.5B` (head_dim=64, E8-compatible, runs on
-free CPU tier). Override `KAKEYA_DEMO_MODEL` env var to use a larger
-model on a GPU Space.
 ## How it works
@@ -29,7 +32,7 @@ decode) to every K and V written into the cache.
 from transformers import AutoModelForCausalLM
 from kakeyalattice.hf import KakeyaLatticeCache
-model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B")
 cache = KakeyaLatticeCache(
     variant="e8", q_range=38,
     num_hidden_layers=model.config.num_hidden_layers,
@@ -40,14 +43,19 @@ out = model.generate(input_ids, max_new_tokens=200, past_key_values=cache)
 ## What you'll see in the demo
-For each prompt, the app generates four times:
-| config                   | bits/token        | expected quality                       |
-| ------------------------ | ----------------- | -------------------------------------- |
-| bf16 DynamicCache        | 1024 (reference)  | identical to reference                 |
-| E8 Q=152 near-lossless   | ~960 (-6%)        | essentially identical                  |
-| E8 Q=38 balanced         | ~440 (-57%)       | ~1% deviation in ppl                   |
-| E8 Q=10 aggressive       | ~320 (-69%)       | noticeably different but coherent      |
 Wall-clock latency per config is also reported.

 compression at three quality levels (Q=10 aggressive, Q=38 balanced,
 Q=152 near-lossless) on a small HuggingFace causal LM.
+Default model: `Qwen/Qwen3-0.6B` (head_dim=128, GQA 16/8 — the same
+attention shape as modern production LLMs, so the codec numbers are
+representative). Runs on the free CPU tier (each "Run comparison"
+click takes ~4–8 minutes on 2 cores). Override `KAKEYA_DEMO_MODEL`
+env var to use a larger model on a GPU Space (`Qwen/Qwen3-1.7B`,
+`Qwen/Qwen3-4B`).
 ## How it works
 from transformers import AutoModelForCausalLM
 from kakeyalattice.hf import KakeyaLatticeCache
+model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B")
 cache = KakeyaLatticeCache(
     variant="e8", q_range=38,
     num_hidden_layers=model.config.num_hidden_layers,
 ## What you'll see in the demo
+For each prompt, the app generates four times (bits/vec here assume
+head_dim=128 → bf16 baseline is 2048 bits/vec; exact numbers for other
+head_dims scale proportionally):
+| config                   | bits/vec (head_dim=128) | expected quality                  |
+| ------------------------ | ----------------------- | --------------------------------- |
+| bf16 DynamicCache        | 2048 (reference)        | identical to reference            |
+| E8 Q=152 near-lossless   | ~1920 (-6%)             | essentially identical             |
+| E8 Q=38 balanced         | ~880 (-57%)             | ~1% deviation in ppl              |
+| E8 Q=10 aggressive       | ~640 (-69%)             | noticeably different but coherent |
+(The percentage savings `-6% / -57% / -69%` are what matter — they are
+fixed by the E8 codec design and do not depend on head_dim.)
 Wall-clock latency per config is also reported.

app.py CHANGED Viewed

@@ -5,9 +5,10 @@ Run locally:
     python app.py
 Deploy to HF Spaces: see ./SPACE_README.md and ./HF_SPACE_DEPLOY.md.
-By default uses Qwen2-0.5B (head_dim=64, E8-compatible) so it fits on a
-free HF Space CPU. Swap to Qwen/Qwen2.5-1.5B or Llama-3.2-1B (GPU Space)
-for more interesting decode-length comparisons.
 The demo shows, side-by-side, the same prompt generated under:
   (a) bf16 DynamicCache — reference
@@ -33,7 +34,7 @@ except ImportError as e:
 from kakeyalattice.hf import KakeyaLatticeCache
-DEFAULT_MODEL = os.environ.get("KAKEYA_DEMO_MODEL", "Qwen/Qwen2-0.5B")
 DEFAULT_PROMPT = "List five countries in Africa:"
 _model_cache: dict = {}
@@ -174,13 +175,18 @@ with gr.Blocks(title="KakeyaLattice KV-cache compression") as demo:
     gr.Markdown(
         "### About the default model\n\n"
-        f"The default model is **{DEFAULT_MODEL}** (0.5B params). It runs on a "
-        "free HF Space CPU but is *small*. Small models can fall into "
-        "greedy-decode repetition loops on open-ended prompts — that is a "
-        "property of the **model**, not the codec. If you see all four outputs "
-        "repeating the same phrase, try a short, fact-shaped prompt (e.g. "
-        "\"List five countries in Africa:\") or switch to a larger model "
-        "(`KAKEYA_DEMO_MODEL=Qwen/Qwen2.5-1.5B`) on a GPU Space."
     )
     header_out = gr.Markdown("")

     python app.py
 Deploy to HF Spaces: see ./SPACE_README.md and ./HF_SPACE_DEPLOY.md.
+By default uses Qwen3-0.6B (head_dim=128, GQA 16/8, E8-compatible) —
+fits on a free HF Space CPU and is architecturally closer to production
+LLMs than Qwen2-0.5B. Swap to Qwen/Qwen3-1.7B or Qwen/Qwen3-4B
+(GPU Space) for faster / longer comparisons.
 The demo shows, side-by-side, the same prompt generated under:
   (a) bf16 DynamicCache — reference
 from kakeyalattice.hf import KakeyaLatticeCache
+DEFAULT_MODEL = os.environ.get("KAKEYA_DEMO_MODEL", "Qwen/Qwen3-0.6B")
 DEFAULT_PROMPT = "List five countries in Africa:"
 _model_cache: dict = {}
     gr.Markdown(
         "### About the default model\n\n"
+        f"The default model is **{DEFAULT_MODEL}** (0.6B params, head_dim=128, "
+        "GQA 16/8). It runs on a free HF Space CPU in roughly 4–8 minutes per "
+        "'Run comparison' click (four generations × ~128 tokens each on 2 "
+        "cores). That is slow but deliberate: Qwen3's head_dim=128 + GQA is "
+        "the same shape used by most production LLMs, so the E8 codec numbers "
+        "you see here are representative.\n\n"
+        "Small models can still fall into greedy-decode repetition loops on "
+        "open-ended prompts — that is a property of the **model**, not the "
+        "codec. If you see all four outputs repeating the same phrase, try a "
+        "short, fact-shaped prompt (e.g. \"List five countries in Africa:\"). "
+        "For faster decode / larger context, switch to a GPU Space and set "
+        "`KAKEYA_DEMO_MODEL=Qwen/Qwen3-1.7B` or `Qwen/Qwen3-4B`."
     )
     header_out = gr.Markdown("")

requirements.txt CHANGED Viewed

@@ -2,7 +2,7 @@
 # Loose-pinned (>=) so security patches land automatically.
 kakeyalattice[hf]>=1.5.0
 gradio>=4.44
-transformers>=4.45
 # CPU torch (via --extra-index-url https://download.pytorch.org/whl/cpu in Dockerfile)
 torch>=2.1
 # Deps that transformers pulls but we want explicit for the free-CPU Space

 # Loose-pinned (>=) so security patches land automatically.
 kakeyalattice[hf]>=1.5.0
 gradio>=4.44
+transformers>=4.51  # Qwen3ForCausalLM requires 4.51+
 # CPU torch (via --extra-index-url https://download.pytorch.org/whl/cpu in Dockerfile)
 torch>=2.1
 # Deps that transformers pulls but we want explicit for the free-CPU Space