geolip-aleph-lm

Aleph-routed attention: a signed-projective geometric codebook serving as both the attention feature map and the prediction substrate of a byte-level language model.

This repository holds the code, checkpoints, and experiment ledger for the aleph line of the GeoLIP geometric program. It is a research instrument, not a product model — the artifacts here are language models in the strict density-estimation sense (they predict the next bytes of English), and they are studied for what their geometry does, not deployed for what they say.

Part of the GeoLIP geometric ecosystem. Companion repositories: procrustes-analysis (the 17-model cross-architecture survey), geolip-deep-embedding-analysis (the 65,536-config CV-band study), and the GeoLIP constellation packages.

What an "aleph" is

An aleph is a learned codebook of K unit directions in a low-dimensional address space (D-dimensional, default D=4). At each position the model computes a signed antipodal address over those directions,

p±  =  exp(±u − m) / Z

a pair of probability vectors (positive and negative hemisphere) on the 2K-simplex, with a guaranteed-positive denominator that makes the whole operation a valid linear-attention feature map and exactly streamable. The same codebook can serve two roles:

Routing — the address is the attention kernel feature map, giving O(n·K) attention instead of O(n²) softmax. Every configuration in this repo is aleph-routed.
Prediction — when the output head scores candidates through the codebook (the kernel and apmix heads), prediction error flows back into the codebook geometry, closing a control loop. This is aleph prediction, and it is the load-bearing claim of the program.

The distinction between routing and prediction is not cosmetic — it is the single most important finding here, measured directly (see The controller contrast).

Architecture

AlephLM is a byte-level trigram model: it predicts the next three bytes jointly through a gate × bank + byte-tail mixture.

Backbone — a stack of aleph-routed attention layers (signed-projective addressing, linear in sequence length, constant recurrent state). The 23M configuration carries ~532K floats of recurrent state and an effective context of ~12 KB through streaming, at constant memory.
Heads (selectable, this is what the experiments compare):
- byte — plain byte softmaxes (the control).
- kernel — π·κ log-kernel over the bank, through the codebook (the original aleph predictor; rank-bounded on dense banks by construction).
- pmix — a mixture-of-pointers over an independent learned sphere (escapes the rank bound, but bypasses the codebook — see the audit).
- apmix — mixture-of-pointers aiming in the Hellinger embedding of the signed codebook address (√κ on S^(2K−1), Bhattacharyya affinity). Keeps the rank escape and restores full prediction pressure through the codebook. This is the synthesis head.
Bank — an external tensor of candidate trigrams drawn from a stratified lexical atlas (12.9M n-grams across char/word/unicode strata). The bank is data, not parameters — swappable without retraining (demonstrated below).
Statute instrumentation — a live gauge reports the codebook's geometric state every log step: deviation on a degenerate↔uniform↔polytope axis, effective rank, minimum-angle spread, routing concentration. Collapse and health are visible in real time.

The two-hemisphere law

The codebook is governed by two opposing forces, both observed directly:

Routing pressure is rank-destructive. Under a head that does not employ the codebook for prediction, streamed long-context routing drives the codebook toward low rank (a 23M streamed run fell from effective rank 3.66 to 1.89; see the calibration run below).
Prediction pressure is rank-constructive. Under a head that scores through the codebook, discrimination demand builds rank (kernel runs climb monotonically, effective rank 3.82 → 3.93 over 10K steps, four-way replicated).

A codebook collapses not because something attacks it but because its only demanding employer was removed. This reframes "preservation" as employment.

The controller contrast

The decisive measurement. Two trained 23M checkpoints — one with a codebook-employing kernel head, one with a codebook-bypassing pmix head — subjected to the identical codebook ablation (permute the codebook rows, re-measure bits-per-byte):

model	head	permute Δbpb	randomize Δbpb
kernel (closed loop)	`kernel`	+2.33	+2.26
pmix (open loop)	`pmix`	+0.009	+0.020

Same formulas, same K=64/D=4 codebook, same corpus. A ~250× dependence ratio. The aleph is a genuine structural controller — load-bearing to the tune of 2.3 bits per byte — exactly where prediction closes the loop, and vestigial where it does not. The bypassing model had learned to route around its own codebook.

This is why the program's verdict is no champion without the aleph doing work: a strong bits-per-byte number from a bypassing head is a remix of mixture-of-softmaxes and linear attention, not a test of the geometric thesis.

What's in the repo

checkpoints/                  every LM checkpoint, flat (full AlephLM state dicts)
  exp_008_tier_a.pt           the 23M calibration model
  d4.pt … d1024.pt            exp_004 multiscale-lens ladder (D=4…1024)
  {byte,kernel_*,pmix*_*}_s*.pt   exp_007 seed battery (18 runs, two seeds)
  {free,div_*,frozen,slowlane}.pt exp_009 preservation arms
  {ctx,kaux,head,bld,apx,pmx,bk,wr,gr}*.pt   exp_010 component survey
CHECKPOINTS.md                load-and-purpose quick reference
EXPERIMENTAL_CHECKPOINTS.md   per-checkpoint statistics (bpb, dev, rank, params)
RESEARCH_HISTORY.md           the full discovery ledger and theorem table

Checkpoints were migrated flat from the experiment tree of the source research repo; each is a complete AlephLM state dict (model_state_dict, config, bank). The two manifest files document purpose and statistics respectively. Codebook snapshots and result tables remain in the source experiment repo.

Headline results

The 23M calibration run (exp_008)

A 23,114,451-parameter model, 8000 steps on WikiText-103-raw, streamed to ~4096 effective trigrams, pmix head. Reached 2.16 bits/byte (train-stream), bracketing English's trigram branching factor at the measured mixture entropy. Honestly positioned: this is 2013–2015 LSTM-class statistics — a better compressor than bzip2, behind modern same-parameter models, and beatable by a classical Kneser-Ney 5-gram on this corpus. It proved the scaffolding scales; it also collapsed its (unemployed) codebook, which is what motivated the identity audit and the restoration head.

The identity audit and `apmix` (exp_009 + the audit)

A code trace plus the ablation above established that the pmix champion was not performing aleph prediction. The apmix head was designed to re-close the loop, and a five-arm preservation battery confirmed the open-loop neutrality prediction sharply: free, diversity-regularized, frozen, and slow-lane codebooks all landed within 0.001 bpb of each other under pmix — the codebook is pure ornament when the loop is open.

Earlier experiments present in the repo

Alongside the runs detailed below, checkpoints/ includes two earlier lines: the exp_004 multiscale-lens ladder (single-aleph models at D = 4 … 1024, the empirical basis for choosing D=4 as the working address dimension) and the exp_007 seed battery (18 two-seed runs establishing the statute dose-response and the scorer×bank crossover that exposed the T7 rank bound; champion arm pmix8_dense at 2.323 bpb). Statistics for these are in their source experiment directories; the per-checkpoint table in EXPERIMENTAL_CHECKPOINTS.md focuses on the exp_008–010 runs profiled in detail.

The component-potential survey (exp_010, 44 arms)

A small-backbone (6.75M) sweep across eight families, ~4 minutes per arm:

Hemisphere dial is free. A kernel-auxiliary loss through the codebook moves deviation −0.004 → +0.007 and effective rank 3.79 → 3.90 dose-monotone, while in-bank NLL, coverage, and gate accuracy stay identical to three decimals. Codebook employment at zero LM-quality cost.
Statute-by-construction earns its keep. A frozen, repulsion-optimized maximal-spread codebook (the "spread" construction) beat the free codebook under both employed heads (apmix 2.785 < 2.796; kernel 2.877 < 2.889), while a frozen-random control beat neither — so the gain is the geometry, not the freezing.

Grafting works — three ways. Warm-starting an expanded model from a trained donor (2000 steps) versus cold training:

graft	bpb @2k	reference
depth 4→6 (layer interpolation)	2.484	beats its own 5k donor (2.577) and 5k cold (2.718)
mixture J→8 (tile + jitter)	2.468	beats donor
head `pmix`→`apmix` (identity retrofit)	2.950	beats cold same-arch (3.093)

The depth graft beat the model it was grown from, in 40% of the steps. None collapsed on transplant.

Two clean negatives. The write-head auxiliary is bits-per-byte-neutral at this scale (drop candidate). The bank survives a full vocabulary swap with continuous geometry (vocabulary is data).

The practical upshot: employment + construction + grafting is the cohesive shape. A pretrained model can be retrofitted to the closed-loop apmix head, deepened, or widened — each as a cheap warm-start rather than a retrain.

How the aleph relates to known work

No major model uses the aleph as constructed. Organ-level cousins exist and the program names them honestly: MoE routers (an unsigned, single-hemisphere address that routes FFN compute, with router-collapse mirroring codebook degeneracy and balance losses mirroring the diversity term); Performer/FAVOR+ (positive feature maps for linear attention, but random and high-dimensional rather than a learned low-D codebook); product-key memory; Perceiver inducing points; VQ-VAE codebooks (hard quantization vs. the aleph's soft addressing). What is unoccupied: the signed antipodal address serving attention and prediction through one geometry, the statute/basin theory treating that geometry as a first-class governed object, and the controller-contrast methodology itself. A formal prior-art sweep is recommended before any formal publication.

Connections to the wider GeoLIP program

The aleph's home dimension D=4 sits deliberately in the volatile regime: the CV-band physics measured across 65,536 configurations and 17+ models place the phase boundary at the binding constant 0.29154 (with the trained-activation CV converging to the uniform-sphere value for each ambient dimension). D=4 extrapolates to CV ≈ 0.9, far above the stable band — which is exactly why the aleph's defense stack (sphere-normalization, bounded scaling, the statute gauge) is load-bearing rather than decorative. The aleph line is the language-modeling probe of that same geometric substrate.

Status and honesty notes

These models complete text; they do not converse or follow instructions. Nearest legitimate-LLM-fluency tier (TinyStories-class) is ~10× the calibration run's compute — roughly one GPU-day — and remains future work.
Bits-per-byte figures from training streams are not held-out validation and run slightly optimistic; WikiText-103 is cleaner than enwik8.
The geometric findings (controller contrast, two-hemisphere law, construction advantage) are the contribution. The bits-per-byte ladder is where the rest of the field lives; the geometry is where this program lives.
Reliable as of the experiments dated in RESEARCH_HISTORY.md (2026-06). Claims there carry their evidence basis and date.

Citation

@misc{geolip_aleph_lm_2026,
  title  = {geolip-aleph-lm: Aleph-Routed Attention as a Structural Controller},
  author = {Coelho, Philip},
  year   = {2026},
  note   = {GeoLIP geometric program},
  url    = {https://huggingface.co/AbstractPhil/geolip-aleph-lm}
}

License

MIT.

Downloads last month: -; Downloads are not tracked for this model. How to track