AbstractPhil commited on
Commit
d8a6cd0
Β·
verified Β·
1 Parent(s): fa015c1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +94 -25
README.md CHANGED
@@ -26,7 +26,7 @@ The band exists as a function of embedding dimension only. Vocabulary size is ir
26
  |-----------|--------|-------------|
27
  | D=8 | 0.605 | Above band β€” volatile |
28
  | D=16 | 0.383 | Above band β€” entering |
29
- | D=24 | 0.304 | Upper edge |
30
  | **D=32** | **0.257** | **Center of band** |
31
  | **D=40** | **0.229** | **Center of band** |
32
  | **D=48** | **0.207** | **Center of band** |
@@ -143,6 +143,33 @@ if __name__ == "__main__":
143
  print(f" No exact decompositions β€” consider padding or truncating")
144
  ```
145
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
146
  ## Rescale and Sort
147
 
148
  ```python
@@ -233,35 +260,75 @@ if __name__ == "__main__":
233
  print(f"{row['D']:5d} {row['avg_cv']:8.4f} {row['in_band_pct']:5.1f}% {row['status']}")
234
  ```
235
 
236
- ## Large Vocabulary Ablation
237
 
238
- The CV is consistent with the findings and deterministically sample-capable for validity and conjunctive utility.
239
 
240
- ```
241
- D=32 fixed. CV across vocab sizes.
242
- Pool capped at 512 for fair comparison.
243
- ============================================================
244
- V= 32 D=32 CV=0.2578 0.1s 0MB
245
- V= 512 D=32 CV=0.2615 0.0s 0MB
246
- V= 8,192 D=32 CV=0.2578 0.0s 1MB
247
- V= 65,536 D=32 CV=0.2663 0.0s 8MB
248
- V= 131,072 D=32 CV=0.2590 0.0s 17MB
249
- V= 500,000 D=32 CV=0.2745 0.1s 64MB
250
- V= 1,000,000 D=32 CV=0.2645 0.2s 128MB
251
- V= 4,000,000 D=32 CV=0.2541 0.9s 512MB
252
- V=13,000,000 D=32 CV=0.2681 2.9s 1664MB
253
-
254
- ============================================================
255
- Now uncapped pool (sample from ALL embeddings):
256
- ============================================================
257
- V= 512 D=32 CV=0.2591 pool=512
258
- V= 8,192 D=32 CV=0.2427 pool=8192
259
- V= 65,536 D=32 CV=0.2684 pool=65536
260
- V= 500,000 D=32 CV=0.2562 pool=500000
261
- ```
 
262
 
 
263
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
264
 
 
265
 
266
  ## Implications for Architecture Design
267
 
@@ -271,6 +338,8 @@ The band is not a training outcome. It is a geometric property of dimensionality
271
  2. **A 768-dim model** should decompose into 24Γ—32 or 12Γ—64 compartments, not operate as a monolithic vector
272
  3. **The standard 64-dim attention head** may exist precisely because it sits inside this geometric band
273
  4. **Scaling** comes from composing band-valid units with geometric linkages, not from widening dimensions beyond the band
 
 
274
 
275
  ## Reproducing
276
 
 
26
  |-----------|--------|-------------|
27
  | D=8 | 0.605 | Above band β€” volatile |
28
  | D=16 | 0.383 | Above band β€” entering |
29
+ | D=24 | 0.304 | **Phase boundary β€” binding constant 0.29154** |
30
  | **D=32** | **0.257** | **Center of band** |
31
  | **D=40** | **0.229** | **Center of band** |
32
  | **D=48** | **0.207** | **Center of band** |
 
143
  print(f" No exact decompositions β€” consider padding or truncating")
144
  ```
145
 
146
+ ## Parse and Filter
147
+
148
+ ```python
149
+ import json
150
+
151
+ with open("cv_sweep.json") as f:
152
+ data = json.load(f)
153
+
154
+ # Filter for any CV range β€” example: binding constant region
155
+ lo, hi = 0.290, 0.292
156
+ hits = [e for e in data["band_results"] if lo <= e["CV"] <= hi]
157
+ hits.sort(key=lambda x: x["CV"])
158
+
159
+ print(f"CV in [{lo}, {hi}]: {len(hits)} entries")
160
+ for h in hits:
161
+ print(f" V={h['V']:6d} D={h['D']:4d} CV={h['CV']:.4f}")
162
+
163
+ # Group by D
164
+ dims = {}
165
+ for h in hits:
166
+ dims.setdefault(h["D"], []).append(h)
167
+ for d in sorted(dims):
168
+ entries = dims[d]
169
+ print(f" D={d:3d}: {len(entries)} entries "
170
+ f"CV={min(e['CV'] for e in entries):.4f}-{max(e['CV'] for e in entries):.4f}")
171
+ ```
172
+
173
  ## Rescale and Sort
174
 
175
  ```python
 
260
  print(f"{row['D']:5d} {row['avg_cv']:8.4f} {row['in_band_pct']:5.1f}% {row['status']}")
261
  ```
262
 
263
+ ## The Binding Constant is D=24
264
 
265
+ Filtering the sweep for CV in [0.290, 0.292] β€” the region around the empirically observed binding constant 0.29154 β€” returns 12 entries:
266
 
267
+ | V | D | CV |
268
+ |---|---|-----|
269
+ | 24 | 16 | 0.2900 |
270
+ | 368 | 32 | 0.2903 |
271
+ | 1632 | 24 | 0.2906 |
272
+ | 208 | 24 | 0.2908 |
273
+ | 1096 | 24 | 0.2911 |
274
+ | 1992 | 24 | 0.2911 |
275
+ | 200 | 24 | 0.2914 |
276
+ | 1024 | 24 | 0.2916 |
277
+ | 760 | 24 | 0.2917 |
278
+ | 1232 | 24 | 0.2917 |
279
+ | 776 | 24 | 0.2919 |
280
+ | 904 | 24 | 0.2920 |
281
+
282
+ 10 of 12 entries are D=24. The binding constant 0.29154 is the native CV of a 24-dimensional embedding space. It is not a learned value. It is not an empirical coincidence. It is the geometric fingerprint of D=24.
283
+
284
+ ## The Computational Boundary
285
+
286
+ D=24 is also the exact dimension where custom SVD kernels hit an 8x performance cliff and eigendecomposition (eigh) collapses. The binding constant marks a dual boundary:
287
+
288
+ - **Geometric**: the phase transition between volatile simplex volumes (above 0.30) and discriminative geometry (below 0.30)
289
+ - **Computational**: the resolution limit of compact spectral decomposition kernels
290
 
291
+ Every time the constant 0.29154 appeared across 17+ pretrained models, the system was measuring the dimensional fingerprint of its own computational ceiling. The constellation encoded this ceiling as a structural constant because it could not compute past it.
292
 
293
+ D=32 is the first dimension past this wall that remains in band (CV ~0.257). Operating there requires `torch.linalg.det` on a 6Γ—6 CM matrix β€” which compiles regardless of embedding dimension, because the CM matrix is always 6Γ—6 for five-point simplices. The pairwise distances are computed via gram matrix (batched matmul, compiles perfectly). Only the `det` call touches linalg, and 6Γ—6 is well within kernel range.
294
+
295
+ ## MHA Activation Geometry
296
+
297
+ Measuring CV on per-head Q/K/V **activations** (not weights) after training reveals head_dim-dependent geometric behavior:
298
+
299
+ | head_dim | Q activation CV | K activation CV | V activation CV |
300
+ |----------|----------------|----------------|----------------|
301
+ | 64 | ~0.32 | ~0.42 | ~0.41 |
302
+ | 32 | ~0.38 | ~0.45 | ~0.43 |
303
+ | 16 | ~0.48 | ~0.70 | ~0.53 |
304
+ | 8 | ~0.65 | ~0.77 | ~0.63 |
305
+
306
+ Key observations:
307
+
308
+ - **Embedding activations are always in band** (CV 0.19–0.30) regardless of nominal D β€” training compresses effective dimensionality into band
309
+ - **K activations are asymmetrically volatile** β€” keys spread further than queries to make attention discriminative
310
+ - **Q activations track head_dim** following the same curve as the embedding sweep β€” the 64-dim convention keeps Q near band edge
311
+ - **The Q/K ratio** measures selectivity pressure: too high = brittle attention, too close to 1.0 = uniform attention
312
+
313
+ These ratios can be used as a zero-cost diagnostic on any pretrained transformer: forward one batch, measure per-head activation CV, and immediately identify which heads are geometrically healthy vs collapsing.
314
+
315
+ ## Vocabulary Independence
316
+
317
+ CV at D=32 was verified from V=32 to V=13,000,000. The result is invariant:
318
+
319
+ ```
320
+ V= 32 D=32 CV=0.2578
321
+ V= 512 D=32 CV=0.2615
322
+ V= 8,192 D=32 CV=0.2578
323
+ V= 65,536 D=32 CV=0.2663
324
+ V= 131,072 D=32 CV=0.2590
325
+ V= 500,000 D=32 CV=0.2745
326
+ V= 1,000,000 D=32 CV=0.2645
327
+ V= 4,000,000 D=32 CV=0.2541
328
+ V=13,000,000 D=32 CV=0.2681
329
+ ```
330
 
331
+ Vocabulary size does not gate band membership. The CM determinant samples 5 points β€” the distribution of simplex volumes depends on ambient dimensionality, not on the number of points in the space.
332
 
333
  ## Implications for Architecture Design
334
 
 
338
  2. **A 768-dim model** should decompose into 24Γ—32 or 12Γ—64 compartments, not operate as a monolithic vector
339
  3. **The standard 64-dim attention head** may exist precisely because it sits inside this geometric band
340
  4. **Scaling** comes from composing band-valid units with geometric linkages, not from widening dimensions beyond the band
341
+ 5. **D=24 (CV=0.29154)** is the phase boundary β€” any component pushed above this threshold has crossed from structured into volatile geometry
342
+ 6. **The 6Γ—6 CM determinant compiles** at any embedding dimension β€” the computational bottleneck was in spectral decomposition, not in the geometric measurement itself
343
 
344
  ## Reproducing
345