Multi-Xi SPLM with Structured V_theta (SQ3 Mixture of Quadratic Wells)

A structured variant of the Multi-Xi SPLM conservative language model in which the MLP scalar potential VΞΈV_\theta is replaced by a mixture of K=8 diagonal quadratic wells (SQ3). This replacement yields:

  • Full analytical gradients β€” no torch.autograd.grad needed for the conservative force, eliminating the second-order computation graph (~2x speedup on force evaluation)
  • Explicit attractor centres β€” the 8 semantic attractors ΞΌk(ΞΎ)\mu_k(\xi) are readable directly from the model parameters, with no gradient-descent extraction required
  • Interpretable basin structure β€” attractor decoding reveals semantic specialisation: story-opening basins, family-dialogue basins, punctuation-in-quotes basins, and function-word basins

The trade-off is a modest expressivity ceiling: 13.33 PPL vs the MLP baseline's 11.51 PPL (5.5% excess cross-entropy), reflecting the fundamental limitation of Gaussian/quadratic potential wells compared to an unrestricted MLP.

This model is from the Semantic Simulation framework.

Table of Contents

When to Use This Model

Choose this structured variant over the MLP-based Multi-Xi SPLM when:

Priority Structured V_theta (this model) MLP V_theta (baseline)
Interpretability 8 explicit attractor centres, zero-cost basin readout Black-box; requires 1,500-step GD extraction per prompt
Inference speed ~2x faster force computation (analytical gradient) Standard (autograd)
Raw PPL 13.33 11.51
Memory No second-order graph for V_theta Full graph retained

Rule of thumb: if you need to inspect what the model has learned (basin structure, attractor semantics, force-field geometry) or you need fast inference, use this model. If you need the best possible perplexity, use the MLP baseline.

Architecture

Input tokens x_1, ..., x_T
       |
   Embedding E[x] + positional encoding
       |
   For each of L=8 integration steps:
       |
       +-- K-EMA channels: xi^(k)_t = causal_ema(h, alpha_k)   [K_xi=4 channels]
       |
       +-- Structured V_theta:
       |     xi_flat = flatten(xi_1..xi_K)                      [K_xi * d = 1024]
       |     V = -tau * logsumexp_k(-E_k/tau + log pi_k)        [K_mix=8 wells]
       |     where E_k = 0.5 * a_k(xi)^T (h - mu_k(xi))^2
       |
       +-- Analytical force: f = -grad_h V                      [closed-form, no autograd]
       |
       +-- Damped Euler step: v += dt*f/m; v /= (1 + dt*gamma); h += dt*v
       |
       +-- LayerNorm(h)
       |
   Logits = h @ E^T                                            [tied embeddings]
Parameter Value
Hidden dim (d) 256
Layers (L) 8
V_theta kind SQ3 (mixture of K quadratic wells)
Mixture components (K_mix) 8
Temperature (tau) 1.0
Xi channels (K_xi) 4
V_theta input dim (K_xi + 1) * d = 1280
V_theta parameters 4,207,625
Mass model logfreq (frozen surprisal lookup)
Damping gamma 0.30 (fixed)
lambda_V (output regularisation) 0.01
Total parameters 17,335,568

Structured V_theta: How It Works

The standard SPLM uses an MLP to compute VΞΈ(ΞΎ,h)β†’RV_\theta(\xi, h) \to \mathbb{R} and obtains the conservative force via torch.autograd.grad. This model replaces the MLP with a mixture of 8 diagonal quadratic energy wells:

Ek(ΞΎ,h)=12ak(ΞΎ)⊀(hβˆ’ΞΌk(ΞΎ))2E_k(\xi, h) = \frac{1}{2} a_k(\xi)^\top (h - \mu_k(\xi))^2

VΞΈ(ΞΎ,h)=βˆ’Ο„logβ‘βˆ‘k=1KΟ€k(ΞΎ)exp⁑(βˆ’Ek(ΞΎ,h)/Ο„)V_\theta(\xi, h) = -\tau \log \sum_{k=1}^{K} \pi_k(\xi) \exp(-E_k(\xi, h) / \tau)

where ΞΌk(ΞΎ)\mu_k(\xi) are the attractor centres, ak(ΞΎ)>0a_k(\xi) > 0 are the diagonal precisions, and Ο€k(ΞΎ)\pi_k(\xi) are mixing weights. All are linear projections of the flattened xi context.

The force is computed in closed form:

f=βˆ’βˆ‡hVΞΈ=βˆ’βˆ‘k=1Kqk(ΞΎ,h)β‹…ak(ΞΎ)βŠ™(hβˆ’ΞΌk(ΞΎ))f = -\nabla_h V_\theta = -\sum_{k=1}^{K} q_k(\xi, h) \cdot a_k(\xi) \odot (h - \mu_k(\xi))

where qkq_k are the softmax responsibilities. No autograd.grad call is needed β€” parameter gradients still flow through the normal loss.backward() chain.

At Ο„=1\tau = 1, this potential is the negative log marginal likelihood of a K-component Gaussian mixture, connecting the SPLM framework directly to the Gaussian well motivation of the paper (Section 4).

For full derivations, all four structured variants (SQ1--SQ4), cost analysis, attractor basin analysis, and hyperparameter selection strategies, see the companion note: Structured_VTheta_Design_and_Theory.md.

How to Get Started

import torch, sys
sys.path.insert(0, "multixi")
sys.path.insert(0, "parf")
sys.path.insert(0, "energetic_minima")
sys.path.insert(0, "sarf_mass_variant")

from multixi.model_multixi import (
    ScalarPotentialLMSARFMassLNMultiXi,
    SPLMSARFMassLNMultiXiConfig,
)
from parf.model_structured_vtheta import MixtureQuadraticVTheta
from parf.model_structured_vtheta_multixi import StructuredVThetaMultiXiAdapter

# -- Build base model --
config = SPLMSARFMassLNMultiXiConfig(
    vocab_size=50257, d=256, n_layers=8,
    v_hidden=1024, v_depth=3,
    max_len=1024, block_size=512,
    gamma=0.30, xi_channels=4,
    xi_alpha_inits=[0.25, 0.5, 0.75, 0.95],
    xi_learnable=True, mass_mode="logfreq",
    logfreq_path="logfreq_surprisal_tinystories.npy",
)
model = ScalarPotentialLMSARFMassLNMultiXi(config)

# -- Swap in structured V_theta --
K_xi, d = 4, 256
inner = MixtureQuadraticVTheta(d=K_xi * d, K=8, tau=1.0)
model.V_theta = StructuredVThetaMultiXiAdapter(inner, K=K_xi, d=d)

# -- Load checkpoint --
from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download(
    repo_id="dimitarpg13/semsimula-splm-multixi-structured-vtheta",
    filename="checkpoint/ckpt_best.pt",
)
state = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(state["model_state_dict"])
model.eval()

print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

# -- Read attractor centres directly --
# (no gradient descent needed!)
x = torch.randint(0, 50257, (1, 64))
with torch.no_grad():
    h = model.embedding(x) + model.pos_enc[:, :64]
    xis = model._compute_xis(h)                        # (1, 64, 4, 256)
    centres = model.V_theta.attractor_centres(xis)      # (1, 64, 8, 1024)
    print(f"Attractor centres shape: {centres.shape}")   # 8 basins per token

Available Artifacts

File Description
checkpoint/ckpt_best.pt Best checkpoint (A2 arm, 13.33 PPL)
training_log.jsonl Per-step training metrics
training_curve_A2.png Training/validation loss curves
v_theta_hist_A2.png V_theta output distribution histogram
landscape_stats_A2.json V_theta landscape statistics (mean, std, range)
attractors_A2.json Decoded attractor centres for 5 prompt types
summary_A2.md Hyperparameters and final metrics
model_structured_vtheta.py Structured V_theta classes (SQ1--SQ4)
model_structured_vtheta_multixi.py Multi-Xi adapter
config.json Model configuration

Training Details

Training Data

TinyStories β€” a synthetic corpus of short children's stories generated by GPT-3.5/4, tokenized with GPT-2 BPE (vocab size 50,257). Training cap: 5M tokens; validation: ~140k tokens.

Training Procedure

The base model architecture is identical to the Multi-Xi SPLM. The only modification is the VΞΈV_\theta replacement: the 3-layer MLP is swapped for the SQ3 mixture at model construction time, before training begins from scratch.

Hyperparameter Value
Optimizer AdamW
Learning rate 5e-4 (cosine decay)
Warmup steps 400
Weight decay 0.01
Gradient clipping 1.0
Batch size 16
Block size 512
Training steps 16,000
lambda_V (V_theta regularisation) 0.01
Hardware NVIDIA A100 40GB (Google Colab)

Sweep Arms

Six arms were trained. This checkpoint is from A2, the best-performing arm:

Arm K_mix tau K_xi Best PPL V_theta params
A1 4 1.0 4 14.10 2.1M
A2 8 1.0 4 13.33 4.2M
A3 4 0.5 4 14.10 2.1M
A4 4 1.0 8 14.16 4.2M
A5 8 1.0 8 14.25 8.4M
B1 (MLP ref.) β€” β€” 4 11.51 β€”

Training Script

notebooks/conservative_arch/scaleup/colab_splm_multixi_structured_vtheta.ipynb β€” Colab notebook with all 6 arms, GDrive output, checkpointing, and live progress display.

Evaluation Results

TinyStories Validation Perplexity

Model PPL Params Analytical grad Explicit attractors
Matched Attention (baseline) 7.81 19.5M β€” β€”
Fock-PARFLM v2.1 9.30 17.4M No No
Multi-Xi SPLM (MLP) 11.51 16.5M No No
Multi-Xi PARFLM 12.06 17.6M No No
Multi-Xi SPLM (SQ3, this model) 13.33 17.3M Yes Yes (8 basins)

V_theta Landscape Statistics

Metric Value
Mean V_theta 99.8
Std V_theta 26.3
Range 644.9
Min 9.0
Max 653.9

Attractor Basin Analysis

The 8 mixture components develop semantic specialisation during training. Projecting each attractor centre ΞΌk(ΞΎ)\mu_k(\xi) through the LM head on five TinyStories prompt types reveals:

  • Story-opening basin (narrative, basin 8): "lived" at 53.7% probability, followed by "wanted", "asked", "said"
  • Family-dialogue basin (dialogue, basin 1): "dad" at 19.3%, "mom" at 15.0%, "it" at 6.6%
  • Punctuation-in-quotes basin (dialogue, basin 7): ',"' at 19.9%, '!"' at 3.6%, '."' at 3.2%
  • Function-word basins: several basins decode to "of", "to", "and" at 6–9% each
  • Unused basins (~3 of 8): decode to near-uniform distributions (all top-5 probabilities less than 5e-5), indicating the model self-selects Keffβ‰ˆ5K_{\text{eff}} \approx 5

This basin structure is readable directly from the model parameters β€” no gradient-descent extraction is needed.

Expressivity Ceiling

The 1.82 PPL gap between this model (13.33) and the MLP baseline (11.51) reflects a fundamental expressivity ceiling of the quadratic-well family:

  1. Gaussian equivalence. Each quadratic energy well corresponds, via the Boltzmann factor, to a Gaussian distribution. The mixture of K wells is therefore equivalent to a K-component GMM, which requires K growing exponentially with intrinsic complexity.

  2. Force-field linearity. Within each basin, the restoring force is linear in h. An MLP can generate arbitrarily nonlinear force fields. Specifically, βˆ’βˆ‡hVk=βˆ’Ak(hβˆ’ΞΌk)-\nabla_h V_k = -A_k(h - \mu_k) is a linear function of h, while an MLP V_theta admits arbitrary nonlinearities.

  3. No inter-basin gradients. Outside all basins (transition regions), the SQ3 force is a responsibility-weighted average of linear forces β€” still fundamentally linear in h within each local region. The MLP captures sharp transitions that the soft mixture cannot.

The gap is small enough (5.5% excess cross-entropy) that structured VΞΈV_\theta remains attractive for applications that value interpretability or computational efficiency over the last few PPL points.

SPLM Family Overview

This model is part of the Semantic Simulation SPLM family:

Model Design PPL HuggingFace
Multi-Xi SPLM (MLP) Pure scalar potential 11.51 semsimula-splm-multixi
Multi-Xi SPLM (SQ3) Structured scalar potential 13.33 this model
Multi-Xi PARFLM Scalar + pairwise forces 12.06 semsimula-parflm-multixi
Fock-PARFLM v2.1 PARFLM + Fock registers 9.30 semsimula-fock-parflm
Hybrid SPLM+Attn Attention + SPLM refinement 8.50 semsimula-hybrid-splm

Collection: Semantic Simulation SPLM Model Family

Bias, Risks, and Limitations

  • Research checkpoint only. This model is a proof-of-concept for structured scalar potentials, not a production system.
  • TinyStories only. Trained exclusively on synthetic children's stories (~5M tokens). Not suitable for general-purpose language generation.
  • English only. No multilingual capability.
  • Small scale. 17.3M parameters, 256-dim hidden states.
  • No safety training. No RLHF, DPO, or safety filtering has been applied.
  • Expressivity ceiling. The structured VΞΈV_\theta has a demonstrated 5.5% excess cross-entropy vs the MLP baseline. Applications requiring maximum PPL should use the MLP variant.

Citation

@misc{Gueorguiev2026SemSim,
  author    = {Gueorguiev, Dimitar P.},
  title     = {Semantic Simulation: A Prescriptive Lagrangian Framework
               for Efficient Semantic Inference --- A Conservative-by-
               Construction Language Model and the Shared-Potential
               Separator, with a Correspondence to Joint Embedding
               Predictive Architectures},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.19712427},
  url       = {https://doi.org/10.5281/zenodo.19712427},
  note      = {Version v15 (Jun 7, 2026).
               Companion code repository (DOI 10.5281/zenodo.20579561):
               \url{https://github.com/dimitarpg13/semsimula-paper}}
}

Environmental Impact

  • Hardware: NVIDIA A100 40GB (Google Colab)
  • Training time: ~1.5 hours (16,000 steps, A2 arm)
  • Carbon footprint: Estimated less than 1 kg CO2
Downloads last month
106
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train dimitarpg13/semsimula-splm-multixi-structured-vtheta

Collection including dimitarpg13/semsimula-splm-multixi-structured-vtheta

Evaluation results