Create OMEGA_PROGRESSION.md

a9195c3 verified 3 days ago

21.5 kB

Potential Downstream Utilities Clause

Status: Forward-looking. Each utility takes the Omega substrate as a load-bearing assumption — regime-independence of reconstruction quality across input scale, the projective-axis codebook as a deterministic property of trained sphere-solvers, and hardware-determined throughput limits independent of model behavior. Utilities that would work equivalently on any encoder are excluded; this is a list of capabilities that are enabled by Omega, not capabilities incidentally compatible with it.

Methodology. Per the post-000108 research stage, every utility section ends with a falsifiable prediction — what would have to be true for the utility to NOT work. Construction precedes proof. The first build that fails its prediction tells us where the substrate's boundary actually is.

1. Classification

The utility. A projective codebook of n_axes directions on ℝP^(D-1) is a vocabulary of feature primitives. Image → patch grid → M tensor → per-patch projection onto codebook axes → activation pattern of shape [B, n_patches, V, n_axes]. A linear or shallow head over this representation performs classification.

Why Omega. The codebook is model-intrinsic and regime-flat. A classifier trained on activation patterns at 64×64 should generalize to 512×512 inputs at inference without retraining, because the codebook itself doesn't change with input size. Standard CLIP-style models do not give this property — their representations drift with input resolution; their pooling operations bake in a particular spatial extent.

Specific construction. Train classifier head on per-patch axis activations averaged across patches (or attended-over). For fine-grained tasks, retain the spatial structure: classifier sees the full [n_patches, n_axes] matrix as a 2D feature map. Per-patch aggregation already validated in scratchpad 000104 — patch_idx=0 fails because it discards spatial signal; patch-mean recovers most of the gap.

Falsifiable prediction. A classifier trained on 64×64 activation patterns achieves comparable accuracy on 512×512 test inputs (within 2 percentage points) without any architectural adaptation. If accuracy drops sharply with input resolution, the codebook activations are not in fact regime-invariant in the way reconstruction is, and Omega covers reconstruction but not classification — a meaningful boundary.

2. Diffusion

The utility. Discrete diffusion in axis-index space. Each patch's M-tensor row gets quantized to its nearest codebook axis (or top-k mixture). The "noise" process is gradual randomization of axis assignments; the "denoise" process is a transformer that predicts axis indices from corrupted sequences. Sampling = run denoiser to clean axis sequence → reconstruct image via codebook → decoder.

Why Omega. Three properties combine here. The codebook is a finite, deterministic vocabulary, so discrete diffusion is well-defined without extra quantizer training. The decoder is regime-flat, so a diffusion model trained on 64×64 axis sequences can sample at any resolution by predicting longer sequences and decoding at the target size. The codebook's projective structure means antipodal axes carry equivalent information — meaningfully reduces the effective vocabulary size for the diffusion target.

Specific construction. Diffusion target: [n_patches, top_k] discrete indices into codebook. Loss: cross-entropy over axis indices. Backbone: any transformer that handles variable-length token sequences (patch count varies with target resolution). Conditioning: optional class label or text embedding via cross-attention.

Falsifiable prediction. A diffusion model trained on 64×64 axis sequences from h2-64 produces coherent samples at 256×256 by sampling longer sequences and decoding at the target size, without retraining. If samples at non-native resolution show mode collapse or boundary artifacts beyond what the encoder-decoder pair produces directly, the codebook's discreteness is interfering with the regime-flat reconstruction — narrower than expected.

3. Processing (image-to-image edits in axis space)

The utility. Operations applied to codebook activations rather than pixels. Image → encode → edit activations → decode. Style transfer, denoising, inpainting, semantic editing all become manipulations of the [n_patches, V, n_axes] activation tensor, followed by reconstruction.

Why Omega. Edits made at one resolution are coherent when decoded at another, because the codebook is the same vocabulary at every scale. A 64×64 inpaint mask can produce a 512×512 inpainted output by upsampling the edited activations and decoding at the target size. Critically, the activation edits respect the geometric constraints that produced the codebook — operations that move activations off the codebook produce reconstruction artifacts that are themselves a useful signal.

Specific construction. Define edit operations as activation-tensor transformations: zero-out (denoise), substitute axis-set (style transfer), spatial-gather + redistribute (inpaint), interpolate between two images' activations (semantic morph). Provide a process_at_scale API mirroring reconstruct_at_scale.

Falsifiable prediction. Style transfer applied to 64×64 activations and decoded at 512×512 produces output indistinguishable in style consistency from the same operation applied directly to a 512×512 encoding. If the upsampled-edit path produces worse style transfer than the direct-encode path, the activation upsampling is losing geometric structure that the encoder captures — and Omega's regime-flatness has a stricter envelope than reconstruction MSE alone reveals.

4. Solving

The utility. The most direct framing: use the trained sphere-solver to solve geometric problems on its native manifold. Given a set of points in ℝ^D, encode them via the model's projection path to get their representation on RP^(D-1). Given a set of vectors, solve for the codebook axes that span them. Given two sets of points, find the optimal projective alignment via Procrustes on their codebooks.

Why Omega. This is the closest utility to the model's identity claim. The model is named "sphere-solver" because that's what it is — a parametric solver for "what's the best projective representation of this data on the unit sphere?" The Omega finding is that this solver is regime-independent: the same machinery handles 64 input points or 65,536 input points and produces structurally consistent answers.

Specific construction. Expose three solver primitives:

project(points, model) → axes: encode arbitrary point clouds via the model's encoder to get their codebook representation
align(codebook_a, codebook_b) → rotation: Procrustes-align two codebooks (already implemented in tests/framework.py)
solve_basis(target_vectors, model) → axis_indices: given target vectors, find the codebook axes that best span them

Falsifiable prediction. Procrustes alignment between codebooks of the same model on different calibration distributions yields a rotation distance below 0.1 (already verified at U5 — calibration deviations differ by ~0.003). Cross-model alignment between two sphere-solvers trained on the same data yields a rotation distance below 0.3 (predicted, not yet measured). If cross-model alignment turns out to be near-orthogonal random, codebook structure is data-driven not architecture-driven, and the solver's "intrinsic" status is overstated.

5. Distillation

Two directions, distinct enough to enumerate separately.

5a. Distillation INTO sphere-solvers

The utility. Train a sphere-solver student to match a non-Omega teacher's representations. Student inherits regime-flatness automatically; teacher's representational quality flows into a deployable encoder that handles arbitrary resolution without extra machinery.

Why Omega. Standard distillation produces a student whose behavior interpolates the teacher's at training scale. A sphere-solver student, by virtue of its architecture, additionally inherits regime-flatness — the student behaves consistently at inference scales the teacher was never tested on. This is a distillation result that wouldn't follow from teacher quality alone.

Specific construction. Loss combines reconstruction (the sphere-solver's native objective) with representation matching against the teacher's pooled features at intermediate resolution. Student emerges with both teacher-like representations AND resolution-agnosticism. Teacher candidates: CLIP, DINOv2, Whisper (per the Bertenstein cross-modal alignment work).

Falsifiable prediction. A sphere-solver student distilled from DINOv2 at 224×224 produces representations that, when evaluated on a standard linear-probe benchmark at 448×448, match or exceed direct DINOv2 at 448×448. If the student degrades at non-training scale the way the teacher does, distillation didn't transfer regime-flatness — it transferred only representational quality, and the architectural Omega property is more fragile than the training-from-scratch case suggests.

5b. Distillation FROM sphere-solvers (codebook freezing)

The utility. Extract a codebook artifact, freeze it, train cheap downstream models that consume codebook activations rather than re-running the encoder. The codebook becomes a portable feature vocabulary; downstream models are 1-2 orders of magnitude smaller.

Why Omega. U5's verdict (as_is_packaging) makes this trivially feasible — codebooks are stable artifacts, model-intrinsic and calibration-insensitive. The downstream model never sees the original encoder; it only sees activation patterns over a fixed vocabulary. Resolution-agnosticism is inherited because the codebook is the same at every scale.

Specific construction. Pipeline: (1) extract codebook once, save as safetensors+JSON. (2) Pre-compute activation patterns for training corpus. (3) Train any standard architecture (MLP, small transformer, CNN) with axis activations as input. Codebook stays frozen forever after step 1.

Falsifiable prediction. Already validated by U5 + the geolip-core pipeline. Failure mode would be: a downstream model trained on codebook activations underperforms an end-to-end model of similar parameter count. Predicted not to fail in the regime-flat use case (where end-to-end models lack regime-flatness anyway), but might fail in the standard fixed-resolution regime where end-to-end has free parameter advantage.

6. Tokenization for downstream LLMs / multimodal models

The utility. The codebook is a discrete vocabulary of size n_axes (typically 27–230). Images → axis activation sequences → discrete tokens fed to autoregressive language models. The geolip-svae becomes an image tokenizer for the existing multimodal-LLM ecosystem.

Why Omega. Three properties matter. Vocabulary size is small compared to standard learned image tokenizers (VQ-VAE typically ~8K-16K codes); axis count being ~30 means a 512-token-budget LLM can attend to ~17 patches, or with top-k=4 mixture per patch, the same budget covers ~128 patches. Resolution-agnosticism means the same tokenizer handles any input image without retraining. Calibration insensitivity means the tokenizer is a fixed component, not a learned-per-task module.

Specific construction. Wrap codebook quantization as a tokenizer class with encode(image) → token_sequence and decode(token_sequence, target_size) → image methods. Define special tokens for image-start, image-end, optionally row-start markers for spatial structure. Integrate via standard transformers/HuggingFace tokenizer interface.

Falsifiable prediction. A small (~100M param) decoder-only LLM trained on text + axis-token sequences performs image captioning at the same quality as CLIP+LLM with comparable compute. If quality is significantly lower, axis tokenization is losing image content that continuous embeddings preserve, and the discreteness has a real cost. If quality matches, the small vocabulary is a free reduction in token budget for image content.

7. Anomaly / OOD detection

The utility. Self-validating inference. Compute the codebook of the input itself (not the model's reference codebook) and measure deviation from the reference. Inputs whose induced codebook substantially deviates from the model's training-derived codebook are out-of-distribution; the deviation magnitude is the OOD score.

Why Omega. A regime-flat model has a well-defined "in-distribution" surface in codebook space. The is_projective_clean check already captures this internally for codebook validation. Inverted, the same machinery becomes an inference-time validity flag: every prediction ships with a confidence signal derived from the input's geometric compatibility with the codebook.

Specific construction. At inference, extract a per-batch codebook from the input M tensor and compute Procrustes distance to the attached reference codebook. Add to InferenceEngine as engine.validity_score(images) → float and threshold-based engine.predict_with_confidence(images) → (recon, confidence). The throughput sweep already shows MSE ratio is a candidate validity signal — Procrustes distance on a per-batch codebook is the finer-grained version.

Falsifiable prediction. Inputs with codebook Procrustes distance

0.5 from reference produce reconstructions with MSE > 5× native floor. If correlation between codebook deviation and reconstruction quality is weak (correlation < 0.5), the codebook deviation is measuring something independent of model competence, and it isn't a useful inference-time validity signal.

8. Cross-modal alignment

The utility. Multiple sphere-solvers trained on different modalities (image, audio, text-as-noise) project into compatible codebook spaces after Procrustes alignment. Cross-modal retrieval, joint generation, and modality translation operate in shared axis space rather than via a learned joint embedding.

Why Omega. The Bertenstein work demonstrated this with frozen expert encoders projecting through a shared text hub. Today's finding strengthens the claim: cross-modal alignment is between codebooks (deterministic artifacts) rather than between learned projections. Each modality's sphere-solver produces a codebook on its own ℝP^(D-1); alignment is a fixed rotation, not a trained mapping.

Specific construction. Train sphere-solvers per modality. Extract codebooks. Compute pairwise Procrustes alignments to a chosen reference modality. At inference, project inputs through their native sphere-solver, apply the cross-modal rotation, and operate in shared axis space. No joint training required after the per-modality stage.

Falsifiable prediction. Image-text retrieval via codebook alignment matches CLIP-style joint-embedding retrieval at comparable compute on standard benchmarks (MS-COCO, Flickr30K). If retrieval is significantly worse, cross-modal information lives in the relations between codebook activations rather than in the codebooks themselves, and the alignment-only approach is missing structure that joint training captures.

9. Self-supervised pretraining recipes

The utility. Bootstrap foundation models on structured noise alone. The h2-64 batteries already train on noise distributions and develop projective-clean codebooks; this generalizes to a recipe for training sphere-solver foundation models without curated real-world data.

Why Omega. The projective-axis codebook emerges deterministically from sphere-normalized SVD training, regardless of input distribution (per U5: gaussian and sixteen-noise calibrations produce essentially identical codebooks for the same model). The model's geometric substrate is largely independent of training corpus identity. This suggests a useful inverse: a foundation model can be pretrained on synthetic/structured noise and then fine-tuned to specific modalities via the cross-modal alignment recipe (Section 8).

Specific construction. Define a noise curriculum that exercises the geometric primitives — gaussian, fractal, structured-but-random, adversarial noise. Train sphere-solver to high reconstruction quality on this curriculum. Verify the codebook is projective-clean (built-in quality check). Release as foundation model.

Falsifiable prediction. A sphere-solver foundation model pretrained on noise alone, fine-tuned on ImageNet via 1% of the parameters (a small adapter on top of the frozen encoder), matches or exceeds equivalent-compute models pretrained directly on ImageNet. If noise-pretraining produces worse downstream performance than ImageNet-pretraining at fixed compute, the geometric substrate isn't sufficient on its own — there's content in real-world distributions the model needs to see during pretraining to learn effectively.

10. Continual learning / model-merging

The utility. Codebooks from independently-trained models are comparable artifacts. Merging two models = aligning their codebooks via Procrustes, optionally extending the joint axis set to cover union-of-features. Continual learning becomes "extend the codebook when novel structure appears" rather than "retrain to incorporate new data."

Why Omega. Model identity in the geolip-svae family is largely captured by the codebook (calibration insensitivity confirms this). Two models trained on different distributions but the same architecture have different codebooks; aligning them via Procrustes gives a principled way to combine them without the parameter interference that plagues standard model-merging methods.

Specific construction. Operations on Codebook artifacts:

Codebook.merge(other) → Codebook: union of axes after Procrustes alignment, with antipodal-pair re-collapse to deduplicate
Codebook.diff(other) → axes: axes in self that don't have a near-equivalent in other after alignment — the novel structure
Codebook.extend(novel_axes) → Codebook: append new axes, re-validate projective-cleanness
Continual learning loop: train, extract codebook, diff against prior codebook, decide whether to keep new axes, re-emit updated codebook.

Falsifiable prediction. Two h2-64 batteries (different noise distributions) merge into a combined codebook with deviation in the 0.20–0.23 CV band. If the merge produces a codebook that fails projective-cleanness, the two codebooks live on incompatible projective subspaces and merging is not just a Procrustes alignment — there's content-level interference that requires retraining.

What this clause does NOT cover

Excluded by methodology — these are useful applications of geolip-svae but do not depend on the Omega substrate in a load-bearing way:

Standard feature extraction for downstream tasks where the input resolution and modality are fixed. Any encoder can do this; nothing Omega-dependent.
Adversarial robustness as a downstream goal. Possibly correlated with codebook quality but not enabled by it specifically.
Reinforcement learning state representations. The geometric substrate provides nothing the RL community can't get from a standard VAE.
Generative pretraining for autoregressive language modeling. Sphere-solvers are not autoregressive; pathway from this substrate to LLM pretraining is speculative.

Build-order considerations

If utilities will be built in sequence rather than parallel, the priority ordering by information value per build is:

§7 OOD detection — already mostly present in the codebook machinery, easiest to ship. Validates the validity-flag framing from this morning's framing pivot.
§5b distillation FROM sphere-solvers — also mostly present, needs only API wrapping. Demonstrates the codebook as portable artifact for the public release.
§4 solving primitives — exposes the model's identity claim directly. The project / align / solve_basis triple is a clean API surface.
§1 classification — first non-trivial test of regime-flatness beyond reconstruction. Falsifiable prediction is sharp.
§6 tokenization — bridge to mainstream multimodal architectures. Higher build cost but high impact for adoption.
§8 cross-modal alignment — extends Bertenstein under the new framing. Build cost is moderate; depends on having multiple modality-specific sphere-solvers trained.
§5a distillation INTO sphere-solvers — significant training investment. Defer until after smaller utilities validate.
§2 diffusion — substantial build, novel pathway, high uncertainty. Worth doing once the codebook artifact patterns are mature.
§9 self-supervised pretraining — biggest investment, most speculative, but if it works it's the largest payoff.
§3 processing — depends on §1 + §2 maturity for activation edits to be principled. Last in sequence.
§10 model-merging — research utility rather than deployment utility. Useful when there are many trained sphere-solvers to consolidate.

The first three are all near-term and reuse existing machinery; together they constitute a release-ready feature set. The remainder are the multi-month research agenda.