datasysdev commited on
Commit
720eddc
·
verified ·
1 Parent(s): b42f744

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +30 -13
README.md CHANGED
@@ -19,6 +19,24 @@ Search projections for ANN-substituted attention on
19
 
20
  Code: [github.com/unixsysdev/ann-sparseattention](https://github.com/unixsysdev/ann-sparseattention)
21
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
  ## What's in this repo
23
 
24
  Per-layer linear search projections `(W_Qs, W_Ks)` of shape `[2560, 64]`,
@@ -56,19 +74,18 @@ on positions 30-128 is on near-zero-weight tail and doesn't affect output.
56
  | 256 | 31.6% | 9.88 | **−0.79%** |
57
  | 512 | 40.8% | 9.67 | **−2.89%** |
58
 
59
- **ANN at K ≥ 256 produces lower perplexity than full attention** — the
60
- sparse-attention denoising effect. Full softmax is forced to spread small
61
- amounts of weight over a long tail of irrelevant keys; truncating to top-K
62
- and renormalizing puts the weight where it matters.
63
-
64
- The novelty is not "ANN beats attention" (denoising is a property of any
65
- hard top-K selection over softmax, see Top-k Attention / Reformer). What's
66
- distinctive is that this method produces the denoised top-K at *sub-linear
67
- cost* via off-the-shelf FAISS HNSW, vs. computing all `O(L²)` scores first.
68
-
69
- Sanity checks pass: same input sequences for `ppl_full` vs `ppl_ann`,
70
- intact causal mask in retrieval, single-softmax renormalization with no
71
- wrapper leakage between iterations.
72
 
73
  ### Compute / quality knobs (FLOP-counted)
74
 
 
19
 
20
  Code: [github.com/unixsysdev/ann-sparseattention](https://github.com/unixsysdev/ann-sparseattention)
21
 
22
+ ## Current status
23
+
24
+ Research prototype. Trained projections work, runtime is a correctness
25
+ prototype, eval envelope is narrow. Treat reported numbers as preliminary.
26
+
27
+ **Validated:** 6-layer pilot on Qwen3-4B-Instruct-2507; WikiText-103 PPL
28
+ preserved at K=128 (gap ≈ +0.7%); learned projections retrieve attention-
29
+ relevant keys.
30
+
31
+ **Not yet validated:** 34-layer / whole-model substitution; long-context
32
+ tasks (LongBench, RULER, needle); wall-clock speedup vs FlashAttention/SDPA;
33
+ KV-cache decode-mode integration; GPU-resident ANN kernel.
34
+
35
+ **Runtime caveat:** the FAISS path here builds CPU indexes per batch and
36
+ the gather step uses dense-style tensor expansion. Compute-reduction
37
+ numbers below are *algorithmic scoring reductions, not measured wall-clock
38
+ speedups.*
39
+
40
  ## What's in this repo
41
 
42
  Per-layer linear search projections `(W_Qs, W_Ks)` of shape `[2560, 64]`,
 
74
  | 256 | 31.6% | 9.88 | **−0.79%** |
75
  | 512 | 40.8% | 9.67 | **−2.89%** |
76
 
77
+ On this small WikiText slice, K ≥ 256 produced lower measured PPL than
78
+ the full-attention reference. A plausible explanation is sparse-softmax
79
+ denoising, but with 12 eval batches, sample noise, packed-boundary artifacts
80
+ (pilot trained with packing on; default in the repo is now off), and
81
+ partial-layer substitution acting like regularization are also candidates.
82
+ Treating it as a hypothesis to confirm via an exact-topK oracle (full QK^T
83
+ top-K restricted attention) at the same K that separates "denoising
84
+ from any sparsity" from "denoising from learned projections."
85
+
86
+ Code-level sanity checks pass: same input sequences for `ppl_full` vs
87
+ `ppl_ann`, intact causal mask in retrieval, single softmax over retrieved
88
+ K with no wrapper leakage between iterations.
 
89
 
90
  ### Compute / quality knobs (FLOP-counted)
91