AbstractPhil
/

geolip-deep-embedding-analysis

Model card Files Files and versions

xet

Community

AbstractPhil commited on Apr 5

Commit

d8a6cd0

verified ·

1 Parent(s): fa015c1

Update README.md

Browse files

Files changed (1) hide show

README.md +94 -25

README.md CHANGED Viewed

@@ -26,7 +26,7 @@ The band exists as a function of embedding dimension only. Vocabulary size is ir
 |-----------|--------|-------------|
 | D=8 | 0.605 | Above band — volatile |
 | D=16 | 0.383 | Above band — entering |
-| D=24 | 0.304 | Upper edge |
 | **D=32** | **0.257** | **Center of band** |
 | **D=40** | **0.229** | **Center of band** |
 | **D=48** | **0.207** | **Center of band** |
@@ -143,6 +143,33 @@ if __name__ == "__main__":
             print(f"  No exact decompositions — consider padding or truncating")
 ```
 ## Rescale and Sort
 ```python
@@ -233,35 +260,75 @@ if __name__ == "__main__":
             print(f"{row['D']:5d}  {row['avg_cv']:8.4f}  {row['in_band_pct']:5.1f}%  {row['status']}")
 ```
-## Large Vocabulary Ablation
-The CV is consistent with the findings and deterministically sample-capable for validity and conjunctive utility.
-```
-D=32 fixed. CV across vocab sizes.
-Pool capped at 512 for fair comparison.
-============================================================
-  V=        32  D=32  CV=0.2578  0.1s  0MB
-  V=       512  D=32  CV=0.2615  0.0s  0MB
-  V=     8,192  D=32  CV=0.2578  0.0s  1MB
-  V=    65,536  D=32  CV=0.2663  0.0s  8MB
-  V=   131,072  D=32  CV=0.2590  0.0s  17MB
-  V=   500,000  D=32  CV=0.2745  0.1s  64MB
-  V= 1,000,000  D=32  CV=0.2645  0.2s  128MB
-  V= 4,000,000  D=32  CV=0.2541  0.9s  512MB
-  V=13,000,000  D=32  CV=0.2681  2.9s  1664MB
-============================================================
-Now uncapped pool (sample from ALL embeddings):
-============================================================
-  V=       512  D=32  CV=0.2591  pool=512
-  V=     8,192  D=32  CV=0.2427  pool=8192
-  V=    65,536  D=32  CV=0.2684  pool=65536
-  V=   500,000  D=32  CV=0.2562  pool=500000
-```
 ## Implications for Architecture Design
@@ -271,6 +338,8 @@ The band is not a training outcome. It is a geometric property of dimensionality
 2. **A 768-dim model** should decompose into 24×32 or 12×64 compartments, not operate as a monolithic vector
 3. **The standard 64-dim attention head** may exist precisely because it sits inside this geometric band
 4. **Scaling** comes from composing band-valid units with geometric linkages, not from widening dimensions beyond the band
 ## Reproducing

 |-----------|--------|-------------|
 | D=8 | 0.605 | Above band — volatile |
 | D=16 | 0.383 | Above band — entering |
+| D=24 | 0.304 | **Phase boundary — binding constant 0.29154** |
 | **D=32** | **0.257** | **Center of band** |
 | **D=40** | **0.229** | **Center of band** |
 | **D=48** | **0.207** | **Center of band** |
             print(f"  No exact decompositions — consider padding or truncating")
 ```
+## Parse and Filter
+```python
+import json
+with open("cv_sweep.json") as f:
+    data = json.load(f)
+# Filter for any CV range — example: binding constant region
+lo, hi = 0.290, 0.292
+hits = [e for e in data["band_results"] if lo <= e["CV"] <= hi]
+hits.sort(key=lambda x: x["CV"])
+print(f"CV in [{lo}, {hi}]: {len(hits)} entries")
+for h in hits:
+    print(f"  V={h['V']:6d}  D={h['D']:4d}  CV={h['CV']:.4f}")
+# Group by D
+dims = {}
+for h in hits:
+    dims.setdefault(h["D"], []).append(h)
+for d in sorted(dims):
+    entries = dims[d]
+    print(f"  D={d:3d}: {len(entries)} entries  "
+          f"CV={min(e['CV'] for e in entries):.4f}-{max(e['CV'] for e in entries):.4f}")
+```
 ## Rescale and Sort
 ```python
             print(f"{row['D']:5d}  {row['avg_cv']:8.4f}  {row['in_band_pct']:5.1f}%  {row['status']}")
 ```
+## The Binding Constant is D=24
+Filtering the sweep for CV in [0.290, 0.292] — the region around the empirically observed binding constant 0.29154 — returns 12 entries:
+| V | D | CV |
+|---|---|-----|
+| 24 | 16 | 0.2900 |
+| 368 | 32 | 0.2903 |
+| 1632 | 24 | 0.2906 |
+| 208 | 24 | 0.2908 |
+| 1096 | 24 | 0.2911 |
+| 1992 | 24 | 0.2911 |
+| 200 | 24 | 0.2914 |
+| 1024 | 24 | 0.2916 |
+| 760 | 24 | 0.2917 |
+| 1232 | 24 | 0.2917 |
+| 776 | 24 | 0.2919 |
+| 904 | 24 | 0.2920 |
+10 of 12 entries are D=24. The binding constant 0.29154 is the native CV of a 24-dimensional embedding space. It is not a learned value. It is not an empirical coincidence. It is the geometric fingerprint of D=24.
+## The Computational Boundary
+D=24 is also the exact dimension where custom SVD kernels hit an 8x performance cliff and eigendecomposition (eigh) collapses. The binding constant marks a dual boundary:
+- **Geometric**: the phase transition between volatile simplex volumes (above 0.30) and discriminative geometry (below 0.30)
+- **Computational**: the resolution limit of compact spectral decomposition kernels
+Every time the constant 0.29154 appeared across 17+ pretrained models, the system was measuring the dimensional fingerprint of its own computational ceiling. The constellation encoded this ceiling as a structural constant because it could not compute past it.
+D=32 is the first dimension past this wall that remains in band (CV ~0.257). Operating there requires `torch.linalg.det` on a 6×6 CM matrix — which compiles regardless of embedding dimension, because the CM matrix is always 6×6 for five-point simplices. The pairwise distances are computed via gram matrix (batched matmul, compiles perfectly). Only the `det` call touches linalg, and 6×6 is well within kernel range.
+## MHA Activation Geometry
+Measuring CV on per-head Q/K/V **activations** (not weights) after training reveals head_dim-dependent geometric behavior:
+| head_dim | Q activation CV | K activation CV | V activation CV |
+|----------|----------------|----------------|----------------|
+| 64 | ~0.32 | ~0.42 | ~0.41 |
+| 32 | ~0.38 | ~0.45 | ~0.43 |
+| 16 | ~0.48 | ~0.70 | ~0.53 |
+| 8 | ~0.65 | ~0.77 | ~0.63 |
+Key observations:
+- **Embedding activations are always in band** (CV 0.19–0.30) regardless of nominal D — training compresses effective dimensionality into band
+- **K activations are asymmetrically volatile** — keys spread further than queries to make attention discriminative
+- **Q activations track head_dim** following the same curve as the embedding sweep — the 64-dim convention keeps Q near band edge
+- **The Q/K ratio** measures selectivity pressure: too high = brittle attention, too close to 1.0 = uniform attention
+These ratios can be used as a zero-cost diagnostic on any pretrained transformer: forward one batch, measure per-head activation CV, and immediately identify which heads are geometrically healthy vs collapsing.
+## Vocabulary Independence
+CV at D=32 was verified from V=32 to V=13,000,000. The result is invariant:
+```
+V=        32  D=32  CV=0.2578
+V=       512  D=32  CV=0.2615
+V=     8,192  D=32  CV=0.2578
+V=    65,536  D=32  CV=0.2663
+V=   131,072  D=32  CV=0.2590
+V=   500,000  D=32  CV=0.2745
+V= 1,000,000  D=32  CV=0.2645
+V= 4,000,000  D=32  CV=0.2541
+V=13,000,000  D=32  CV=0.2681
+```
+Vocabulary size does not gate band membership. The CM determinant samples 5 points — the distribution of simplex volumes depends on ambient dimensionality, not on the number of points in the space.
 ## Implications for Architecture Design
 2. **A 768-dim model** should decompose into 24×32 or 12×64 compartments, not operate as a monolithic vector
 3. **The standard 64-dim attention head** may exist precisely because it sits inside this geometric band
 4. **Scaling** comes from composing band-valid units with geometric linkages, not from widening dimensions beyond the band
+5. **D=24 (CV=0.29154)** is the phase boundary — any component pushed above this threshold has crossed from structured into volatile geometry
+6. **The 6×6 CM determinant compiles** at any embedding dimension — the computational bottleneck was in spectral decomposition, not in the geometric measurement itself
 ## Reproducing