Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -59,11 +59,26 @@ on positions 30-128 is on near-zero-weight tail and doesn't affect output.
|
|
| 59 |
**ANN at K ≥ 256 produces lower perplexity than full attention** — the
|
| 60 |
sparse-attention denoising effect. Full softmax is forced to spread small
|
| 61 |
amounts of weight over a long tail of irrelevant keys; truncating to top-K
|
| 62 |
-
and renormalizing puts the weight where it matters.
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
|
| 68 |
Note: the K-sweep recall numbers (24–41%) are not directly comparable to the
|
| 69 |
in-training `evaluate()` recall (50.9% at K=128). Same checkpoint, same K,
|
|
|
|
| 59 |
**ANN at K ≥ 256 produces lower perplexity than full attention** — the
|
| 60 |
sparse-attention denoising effect. Full softmax is forced to spread small
|
| 61 |
amounts of weight over a long tail of irrelevant keys; truncating to top-K
|
| 62 |
+
and renormalizing puts the weight where it matters.
|
| 63 |
+
|
| 64 |
+
The novelty is not "ANN beats attention" (denoising is a property of any
|
| 65 |
+
hard top-K selection over softmax, see Top-k Attention / Reformer). What's
|
| 66 |
+
distinctive is that this method produces the denoised top-K at *sub-linear
|
| 67 |
+
cost* via off-the-shelf FAISS HNSW, vs. computing all `O(L²)` scores first.
|
| 68 |
+
|
| 69 |
+
Sanity checks pass: same input sequences for `ppl_full` vs `ppl_ann`,
|
| 70 |
+
intact causal mask in retrieval, single-softmax renormalization with no
|
| 71 |
+
wrapper leakage between iterations.
|
| 72 |
+
|
| 73 |
+
### Deployment knobs (L = 4096)
|
| 74 |
+
|
| 75 |
+
| Use case | K | PPL gap | Attention compute reduction |
|
| 76 |
+
|---|---|---|---|
|
| 77 |
+
| Quality-improving | 256 | −0.79% | ~16× |
|
| 78 |
+
| Quality-improving | 512 | −2.89% | ~8× |
|
| 79 |
+
| Quality-preserving | 128 | +0.82% | ~32× |
|
| 80 |
+
| Aggressive | 64 | +2.42% | ~64× |
|
| 81 |
+
| Speed-only | 32 | +4.51% | ~128× |
|
| 82 |
|
| 83 |
Note: the K-sweep recall numbers (24–41%) are not directly comparable to the
|
| 84 |
in-training `evaluate()` recall (50.9% at K=128). Same checkpoint, same K,
|