datasysdev commited on
Commit
92ec570
·
verified ·
1 Parent(s): 8f87cd2

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +20 -5
README.md CHANGED
@@ -59,11 +59,26 @@ on positions 30-128 is on near-zero-weight tail and doesn't affect output.
59
  **ANN at K ≥ 256 produces lower perplexity than full attention** — the
60
  sparse-attention denoising effect. Full softmax is forced to spread small
61
  amounts of weight over a long tail of irrelevant keys; truncating to top-K
62
- and renormalizing puts the weight where it matters. The smooth monotonic
63
- trend (no discontinuous jumps) is consistent with this explanation, and the
64
- sanity checks (same input sequences for `ppl_full` vs `ppl_ann`, intact
65
- causal mask in retrieval, single-softmax renormalization with no wrapper
66
- leakage between iterations) confirm the result is real.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
 
68
  Note: the K-sweep recall numbers (24–41%) are not directly comparable to the
69
  in-training `evaluate()` recall (50.9% at K=128). Same checkpoint, same K,
 
59
  **ANN at K ≥ 256 produces lower perplexity than full attention** — the
60
  sparse-attention denoising effect. Full softmax is forced to spread small
61
  amounts of weight over a long tail of irrelevant keys; truncating to top-K
62
+ and renormalizing puts the weight where it matters.
63
+
64
+ The novelty is not "ANN beats attention" (denoising is a property of any
65
+ hard top-K selection over softmax, see Top-k Attention / Reformer). What's
66
+ distinctive is that this method produces the denoised top-K at *sub-linear
67
+ cost* via off-the-shelf FAISS HNSW, vs. computing all `O(L²)` scores first.
68
+
69
+ Sanity checks pass: same input sequences for `ppl_full` vs `ppl_ann`,
70
+ intact causal mask in retrieval, single-softmax renormalization with no
71
+ wrapper leakage between iterations.
72
+
73
+ ### Deployment knobs (L = 4096)
74
+
75
+ | Use case | K | PPL gap | Attention compute reduction |
76
+ |---|---|---|---|
77
+ | Quality-improving | 256 | −0.79% | ~16× |
78
+ | Quality-improving | 512 | −2.89% | ~8× |
79
+ | Quality-preserving | 128 | +0.82% | ~32× |
80
+ | Aggressive | 64 | +2.42% | ~64× |
81
+ | Speed-only | 32 | +4.51% | ~128× |
82
 
83
  Note: the K-sweep recall numbers (24–41%) are not directly comparable to the
84
  in-training `evaluate()` recall (50.9% at K=128). Same checkpoint, same K,