Update README.md
Browse files
README.md
CHANGED
|
@@ -26,7 +26,7 @@ The band exists as a function of embedding dimension only. Vocabulary size is ir
|
|
| 26 |
|-----------|--------|-------------|
|
| 27 |
| D=8 | 0.605 | Above band β volatile |
|
| 28 |
| D=16 | 0.383 | Above band β entering |
|
| 29 |
-
| D=24 | 0.304 |
|
| 30 |
| **D=32** | **0.257** | **Center of band** |
|
| 31 |
| **D=40** | **0.229** | **Center of band** |
|
| 32 |
| **D=48** | **0.207** | **Center of band** |
|
|
@@ -143,6 +143,33 @@ if __name__ == "__main__":
|
|
| 143 |
print(f" No exact decompositions β consider padding or truncating")
|
| 144 |
```
|
| 145 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 146 |
## Rescale and Sort
|
| 147 |
|
| 148 |
```python
|
|
@@ -233,35 +260,75 @@ if __name__ == "__main__":
|
|
| 233 |
print(f"{row['D']:5d} {row['avg_cv']:8.4f} {row['in_band_pct']:5.1f}% {row['status']}")
|
| 234 |
```
|
| 235 |
|
| 236 |
-
##
|
| 237 |
|
| 238 |
-
|
| 239 |
|
| 240 |
-
|
| 241 |
-
|
| 242 |
-
|
| 243 |
-
|
| 244 |
-
|
| 245 |
-
|
| 246 |
-
|
| 247 |
-
|
| 248 |
-
|
| 249 |
-
|
| 250 |
-
|
| 251 |
-
|
| 252 |
-
|
| 253 |
-
|
| 254 |
-
|
| 255 |
-
|
| 256 |
-
|
| 257 |
-
|
| 258 |
-
|
| 259 |
-
|
| 260 |
-
|
| 261 |
-
|
|
|
|
| 262 |
|
|
|
|
| 263 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 264 |
|
|
|
|
| 265 |
|
| 266 |
## Implications for Architecture Design
|
| 267 |
|
|
@@ -271,6 +338,8 @@ The band is not a training outcome. It is a geometric property of dimensionality
|
|
| 271 |
2. **A 768-dim model** should decompose into 24Γ32 or 12Γ64 compartments, not operate as a monolithic vector
|
| 272 |
3. **The standard 64-dim attention head** may exist precisely because it sits inside this geometric band
|
| 273 |
4. **Scaling** comes from composing band-valid units with geometric linkages, not from widening dimensions beyond the band
|
|
|
|
|
|
|
| 274 |
|
| 275 |
## Reproducing
|
| 276 |
|
|
|
|
| 26 |
|-----------|--------|-------------|
|
| 27 |
| D=8 | 0.605 | Above band β volatile |
|
| 28 |
| D=16 | 0.383 | Above band β entering |
|
| 29 |
+
| D=24 | 0.304 | **Phase boundary β binding constant 0.29154** |
|
| 30 |
| **D=32** | **0.257** | **Center of band** |
|
| 31 |
| **D=40** | **0.229** | **Center of band** |
|
| 32 |
| **D=48** | **0.207** | **Center of band** |
|
|
|
|
| 143 |
print(f" No exact decompositions β consider padding or truncating")
|
| 144 |
```
|
| 145 |
|
| 146 |
+
## Parse and Filter
|
| 147 |
+
|
| 148 |
+
```python
|
| 149 |
+
import json
|
| 150 |
+
|
| 151 |
+
with open("cv_sweep.json") as f:
|
| 152 |
+
data = json.load(f)
|
| 153 |
+
|
| 154 |
+
# Filter for any CV range β example: binding constant region
|
| 155 |
+
lo, hi = 0.290, 0.292
|
| 156 |
+
hits = [e for e in data["band_results"] if lo <= e["CV"] <= hi]
|
| 157 |
+
hits.sort(key=lambda x: x["CV"])
|
| 158 |
+
|
| 159 |
+
print(f"CV in [{lo}, {hi}]: {len(hits)} entries")
|
| 160 |
+
for h in hits:
|
| 161 |
+
print(f" V={h['V']:6d} D={h['D']:4d} CV={h['CV']:.4f}")
|
| 162 |
+
|
| 163 |
+
# Group by D
|
| 164 |
+
dims = {}
|
| 165 |
+
for h in hits:
|
| 166 |
+
dims.setdefault(h["D"], []).append(h)
|
| 167 |
+
for d in sorted(dims):
|
| 168 |
+
entries = dims[d]
|
| 169 |
+
print(f" D={d:3d}: {len(entries)} entries "
|
| 170 |
+
f"CV={min(e['CV'] for e in entries):.4f}-{max(e['CV'] for e in entries):.4f}")
|
| 171 |
+
```
|
| 172 |
+
|
| 173 |
## Rescale and Sort
|
| 174 |
|
| 175 |
```python
|
|
|
|
| 260 |
print(f"{row['D']:5d} {row['avg_cv']:8.4f} {row['in_band_pct']:5.1f}% {row['status']}")
|
| 261 |
```
|
| 262 |
|
| 263 |
+
## The Binding Constant is D=24
|
| 264 |
|
| 265 |
+
Filtering the sweep for CV in [0.290, 0.292] β the region around the empirically observed binding constant 0.29154 β returns 12 entries:
|
| 266 |
|
| 267 |
+
| V | D | CV |
|
| 268 |
+
|---|---|-----|
|
| 269 |
+
| 24 | 16 | 0.2900 |
|
| 270 |
+
| 368 | 32 | 0.2903 |
|
| 271 |
+
| 1632 | 24 | 0.2906 |
|
| 272 |
+
| 208 | 24 | 0.2908 |
|
| 273 |
+
| 1096 | 24 | 0.2911 |
|
| 274 |
+
| 1992 | 24 | 0.2911 |
|
| 275 |
+
| 200 | 24 | 0.2914 |
|
| 276 |
+
| 1024 | 24 | 0.2916 |
|
| 277 |
+
| 760 | 24 | 0.2917 |
|
| 278 |
+
| 1232 | 24 | 0.2917 |
|
| 279 |
+
| 776 | 24 | 0.2919 |
|
| 280 |
+
| 904 | 24 | 0.2920 |
|
| 281 |
+
|
| 282 |
+
10 of 12 entries are D=24. The binding constant 0.29154 is the native CV of a 24-dimensional embedding space. It is not a learned value. It is not an empirical coincidence. It is the geometric fingerprint of D=24.
|
| 283 |
+
|
| 284 |
+
## The Computational Boundary
|
| 285 |
+
|
| 286 |
+
D=24 is also the exact dimension where custom SVD kernels hit an 8x performance cliff and eigendecomposition (eigh) collapses. The binding constant marks a dual boundary:
|
| 287 |
+
|
| 288 |
+
- **Geometric**: the phase transition between volatile simplex volumes (above 0.30) and discriminative geometry (below 0.30)
|
| 289 |
+
- **Computational**: the resolution limit of compact spectral decomposition kernels
|
| 290 |
|
| 291 |
+
Every time the constant 0.29154 appeared across 17+ pretrained models, the system was measuring the dimensional fingerprint of its own computational ceiling. The constellation encoded this ceiling as a structural constant because it could not compute past it.
|
| 292 |
|
| 293 |
+
D=32 is the first dimension past this wall that remains in band (CV ~0.257). Operating there requires `torch.linalg.det` on a 6Γ6 CM matrix β which compiles regardless of embedding dimension, because the CM matrix is always 6Γ6 for five-point simplices. The pairwise distances are computed via gram matrix (batched matmul, compiles perfectly). Only the `det` call touches linalg, and 6Γ6 is well within kernel range.
|
| 294 |
+
|
| 295 |
+
## MHA Activation Geometry
|
| 296 |
+
|
| 297 |
+
Measuring CV on per-head Q/K/V **activations** (not weights) after training reveals head_dim-dependent geometric behavior:
|
| 298 |
+
|
| 299 |
+
| head_dim | Q activation CV | K activation CV | V activation CV |
|
| 300 |
+
|----------|----------------|----------------|----------------|
|
| 301 |
+
| 64 | ~0.32 | ~0.42 | ~0.41 |
|
| 302 |
+
| 32 | ~0.38 | ~0.45 | ~0.43 |
|
| 303 |
+
| 16 | ~0.48 | ~0.70 | ~0.53 |
|
| 304 |
+
| 8 | ~0.65 | ~0.77 | ~0.63 |
|
| 305 |
+
|
| 306 |
+
Key observations:
|
| 307 |
+
|
| 308 |
+
- **Embedding activations are always in band** (CV 0.19β0.30) regardless of nominal D β training compresses effective dimensionality into band
|
| 309 |
+
- **K activations are asymmetrically volatile** β keys spread further than queries to make attention discriminative
|
| 310 |
+
- **Q activations track head_dim** following the same curve as the embedding sweep β the 64-dim convention keeps Q near band edge
|
| 311 |
+
- **The Q/K ratio** measures selectivity pressure: too high = brittle attention, too close to 1.0 = uniform attention
|
| 312 |
+
|
| 313 |
+
These ratios can be used as a zero-cost diagnostic on any pretrained transformer: forward one batch, measure per-head activation CV, and immediately identify which heads are geometrically healthy vs collapsing.
|
| 314 |
+
|
| 315 |
+
## Vocabulary Independence
|
| 316 |
+
|
| 317 |
+
CV at D=32 was verified from V=32 to V=13,000,000. The result is invariant:
|
| 318 |
+
|
| 319 |
+
```
|
| 320 |
+
V= 32 D=32 CV=0.2578
|
| 321 |
+
V= 512 D=32 CV=0.2615
|
| 322 |
+
V= 8,192 D=32 CV=0.2578
|
| 323 |
+
V= 65,536 D=32 CV=0.2663
|
| 324 |
+
V= 131,072 D=32 CV=0.2590
|
| 325 |
+
V= 500,000 D=32 CV=0.2745
|
| 326 |
+
V= 1,000,000 D=32 CV=0.2645
|
| 327 |
+
V= 4,000,000 D=32 CV=0.2541
|
| 328 |
+
V=13,000,000 D=32 CV=0.2681
|
| 329 |
+
```
|
| 330 |
|
| 331 |
+
Vocabulary size does not gate band membership. The CM determinant samples 5 points β the distribution of simplex volumes depends on ambient dimensionality, not on the number of points in the space.
|
| 332 |
|
| 333 |
## Implications for Architecture Design
|
| 334 |
|
|
|
|
| 338 |
2. **A 768-dim model** should decompose into 24Γ32 or 12Γ64 compartments, not operate as a monolithic vector
|
| 339 |
3. **The standard 64-dim attention head** may exist precisely because it sits inside this geometric band
|
| 340 |
4. **Scaling** comes from composing band-valid units with geometric linkages, not from widening dimensions beyond the band
|
| 341 |
+
5. **D=24 (CV=0.29154)** is the phase boundary β any component pushed above this threshold has crossed from structured into volatile geometry
|
| 342 |
+
6. **The 6Γ6 CM determinant compiles** at any embedding dimension β the computational bottleneck was in spectral decomposition, not in the geometric measurement itself
|
| 343 |
|
| 344 |
## Reproducing
|
| 345 |
|