Multi-Xi PARFLM with Structured V_theta (SQ3 Mixture of Quadratic Wells)

A structured variant of the Multi-Xi PARFLM conservative language model in which the MLP scalar potential VθV_\theta is replaced by a mixture of K=8 diagonal quadratic wells (SQ3), while retaining the full MLP-based pairwise potential VϕV_\phi. This replacement yields:

  • Near-zero PPL cost12.27 PPL vs the MLP baseline's 12.10 PPL (only 0.17 PPL gap, 0.6% excess cross-entropy). The pairwise VϕV_\phi absorbs nearly all of the nonlinear structure that the quadratic wells cannot express.
  • Full analytical gradients for VθV_\theta — no torch.autograd.grad needed for the scalar potential force, eliminating the second-order computation graph for the VθV_\theta component
  • Explicit attractor centres — the 8 semantic attractors μk(ξ)\mu_k(\xi) are readable directly from the model parameters, with no gradient-descent extraction required
  • Compressed VθV_\theta landscape — with VϕV_\phi carrying the bulk of the force budget, VθV_\theta collapses to a near-flat bias field (mean 0.02, range 31.4 vs 644.9 for SPLM)

This model is from the Semantic Simulation framework.

Table of Contents

When to Use This Model

Choose this structured variant over the MLP-based Multi-Xi PARFLM when:

Priority Structured V_theta (this model) MLP V_theta (baseline)
Interpretability 8 explicit attractor centres, zero-cost basin readout Black-box; requires 1,500-step GD extraction per prompt
Inference speed ~2x faster V_theta force computation (analytical gradient) Standard (autograd for both V_theta and V_phi)
Raw PPL 12.27 12.10
PPL gap 0.17 PPL (0.6%) — essentially free
Memory No second-order graph for V_theta (V_phi graph retained) Full graph for both

Bottom line: for PARFLM, structured VθV_\theta is the recommended default — the 0.17 PPL gap is negligible, while the interpretability and speed gains are substantial.

Architecture

Input tokens x_1, ..., x_T
       |
   Embedding E[x] + positional encoding
       |
   For each of L=8 integration steps:
       |
       +-- K-EMA channels: xi^(k)_t = causal_ema(h, alpha_k)   [K_xi=4 channels]
       |
       +-- Structured V_theta (SQ3):
       |     xi_flat = flatten(xi_1..xi_K)                      [K_xi * d = 1024]
       |     V = -tau * logsumexp_k(-E_k/tau + log pi_k)        [K_mix=8 wells]
       |     f_theta = -analytical_grad_h V                     [closed-form]
       |
       +-- Pairwise V_phi (competitive structural MLP):
       |     scores = score_net(h_t, h_s)                       [for all s <= t]
       |     top-k selection via Gumbel-softmax                 [k=8 neighbours]
       |     f_phi = -grad_h V_phi(h_t, h_s)                   [autograd, sparse]
       |
       +-- Total force: f = f_theta + f_phi
       |
       +-- Damped Euler step: v += dt*f/m; v /= (1 + dt*gamma); h += dt*v
       |
       +-- LayerNorm(h)
       |
   Logits = h @ E^T                                            [tied embeddings]
Parameter Value
Hidden dim (d) 256
Layers (L) 8
V_theta kind SQ3 (mixture of K quadratic wells)
Mixture components (K_mix) 8
Temperature (tau) 1.0
Xi channels (K_xi) 4
V_phi kind structural_competitive
V_phi hidden 128
Top-k (sparse routing) 8
Gumbel tau 1.0 (init), 0.3 (min)
Gathered V_phi Yes
Per-layer V_phi scale Yes
LN before distance Yes
Layer checkpoint Yes
Mass model logfreq (frozen surprisal lookup)
Damping gamma 0.30 (fixed)
lambda_V (V_theta regularisation) 0.01
Total parameters 17,335,568

Structured V_theta with Pairwise V_phi: Why the Gap Nearly Vanishes

In the standalone Multi-Xi SPLM, replacing the MLP VθV_\theta with SQ3 incurs a 1.82 PPL gap (13.33 vs 11.51). In PARFLM, the same replacement incurs only a 0.17 PPL gap (12.27 vs 12.10) --- an order of magnitude smaller.

The reason is landscape compression: with VϕV_\phi carrying the bulk of the force budget, VθV_\theta collapses to a near-flat bias field:

Landscape metric SPLM structured V_theta PARFLM structured V_theta
Mean V_theta 99.8 0.02
Std V_theta 26.3 0.51
Range 644.9 31.4
PPL gap vs MLP 1.82 (5.5%) 0.17 (0.6%)

The MLP's expressivity advantage is irrelevant when VθV_\theta's dynamic range is negligible. The pairwise VϕV_\phi absorbs the nonlinear structure that quadratic wells cannot express, making structured VθV_\theta essentially free in PARFLM.

The force from the structured VθV_\theta is computed in closed form:

fθ=hVθ=k=1Kqk(ξ,h)ak(ξ)(hμk(ξ))f_\theta = -\nabla_h V_\theta = -\sum_{k=1}^{K} q_k(\xi, h) \cdot a_k(\xi) \odot (h - \mu_k(\xi))

where qkq_k are the softmax responsibilities over the 8 quadratic wells. The VϕV_\phi force still uses autograd (sparse, over top-k=8 neighbours only).

For full derivations, all four structured variants (SQ1--SQ4), landscape compression analysis, attractor basin decoding, and hyperparameter selection strategies, see the companion note: Structured_VTheta_Design_and_Theory.md.

How to Get Started

import torch, sys
sys.path.insert(0, "multixi")
sys.path.insert(0, "parf")
sys.path.insert(0, "energetic_minima")
sys.path.insert(0, "sarf_mass_variant")

from parf.model_parf_multixi import MultiXiPARFLM, MultiXiPARFConfig
from parf.model_structured_vtheta import MixtureQuadraticVTheta
from parf.model_structured_vtheta_multixi import StructuredVThetaMultiXiAdapter

# -- Build base model --
config = MultiXiPARFConfig(
    vocab_size=50257, d=256, n_layers=8,
    v_hidden=1024, v_depth=3,
    max_len=1024, block_size=512,
    gamma=0.30, xi_channels=4,
    xi_alpha_inits=[0.25, 0.5, 0.75, 0.95],
    xi_learnable=True, mass_mode="logfreq",
    logfreq_path="logfreq_surprisal_tinystories.npy",
    v_phi_kind="structural_competitive",
    v_phi_phi_hidden=128, v_phi_theta_hidden=128,
    top_k=8, score_head_hidden=32,
    gumbel_tau_init=1.0, gumbel_tau_min=0.3,
    gumbel_noise=True,
    use_gathered_v_phi=True,
    use_layer_checkpoint=True,
    ln_before_distance=True,
    per_layer_v_phi_scale=True,
)
model = MultiXiPARFLM(config)

# -- Swap in structured V_theta --
K_xi, d = 4, 256
inner = MixtureQuadraticVTheta(d=K_xi * d, K=8, tau=1.0)
model.V_theta = StructuredVThetaMultiXiAdapter(inner, K=K_xi, d=d)

# -- Load checkpoint --
from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download(
    repo_id="dimitarpg13/semsimula-parflm-multixi-structured-vtheta",
    filename="checkpoint/ckpt_best.pt",
)
state = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(state["model_state_dict"])
model.eval()

print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

# -- Read attractor centres directly --
x = torch.randint(0, 50257, (1, 64))
with torch.no_grad():
    h = model._embed(x)
    xis = model._compute_xis(h)                        # (1, 64, 4, 256)
    centres = model.V_theta.attractor_centres(xis)      # (1, 64, 8, 1024)
    print(f"Attractor centres shape: {centres.shape}")   # 8 basins per token

Available Artifacts

File Description
checkpoint/ckpt_best.pt Best checkpoint (A2 arm, 12.27 PPL at step 14,400)
training_log.jsonl Per-step training metrics (40 eval points)
training_curve_A2.png Training/validation loss curves
v_theta_hist_A2.png V_theta output distribution histogram
landscape_stats_A2.json V_theta landscape statistics (mean, std, range)
model_structured_vtheta.py Structured V_theta classes (SQ1--SQ4)
model_structured_vtheta_multixi.py Multi-Xi adapter
config.json Model configuration

Training Details

Training Data

TinyStories --- a synthetic corpus of short children's stories generated by GPT-3.5/4, tokenized with GPT-2 BPE (vocab size 50,257). Training cap: 5M tokens; validation: ~140k tokens.

Training Procedure

The base model architecture is identical to the Multi-Xi PARFLM. The only modification is the VθV_\theta replacement: the 3-layer MLP is swapped for the SQ3 mixture at model construction time, before training begins from scratch. The pairwise VϕV_\phi (competitive structural MLP, hidden=128, top-k=8) is unchanged.

Hyperparameter Value
Optimizer AdamW
Learning rate 5e-4 (cosine decay)
Warmup steps 400
Weight decay 0.01
Gradient clipping 1.0
Batch size 16
Block size 512
Training steps 16,000
lambda_V (V_theta regularisation) 0.01
Hardware NVIDIA A100 40GB (Google Colab)

Training Script

notebooks/conservative_arch/scaleup/colab_parf_multixi_structured_vtheta.ipynb --- Colab notebook with structured V_theta arms, GDrive output, checkpointing, and live progress display.

Evaluation Results

TinyStories Validation Perplexity

Model PPL Params Analytical V_theta grad V_theta--MLP gap
Matched Attention (baseline) 7.81 19.5M --- ---
Fock-PARFLM v2.1 9.30 17.4M No ---
Multi-Xi SPLM (MLP) 11.51 16.5M No ---
Multi-Xi PARFLM (MLP) 12.06 17.6M No ---
Multi-Xi PARFLM (SQ3, this model) 12.27 17.3M Yes 0.17 PPL
Multi-Xi SPLM (SQ3) 13.33 17.3M Yes 1.82 PPL

V_theta Landscape Statistics

Metric This model (PARFLM) SPLM structured V_theta
Mean V_theta 0.02 99.8
Std V_theta 0.51 26.3
Range 31.4 644.9

Learned Xi-Channel Decay Rates

The final learned alpha values [α1,,α4]=[0.11,0.55,0.81,0.97][\alpha_1, \ldots, \alpha_4] = [0.11, 0.55, 0.81, 0.97] are nearly identical to both the SPLM structured V_theta and the MLP PARFLM baseline, confirming that the causal EMA context structure is robust to the VθV_\theta parameterisation change.

SPLM Family Overview

This model is part of the Semantic Simulation SPLM family:

Model Design PPL HuggingFace
Multi-Xi SPLM (MLP) Pure scalar potential 11.51 semsimula-splm-multixi
Multi-Xi SPLM (SQ3) Structured scalar potential 13.33 semsimula-splm-multixi-structured-vtheta
Multi-Xi PARFLM (MLP) Scalar + pairwise forces 12.06 semsimula-parflm-multixi
Multi-Xi PARFLM (SQ3) Structured scalar + pairwise 12.27 this model
Fock-PARFLM v2.1 PARFLM + Fock registers 9.30 semsimula-fock-parflm
Hybrid SPLM+Attn Attention + SPLM refinement 8.50 semsimula-hybrid-splm

Collection: Semantic Simulation SPLM Model Family

Bias, Risks, and Limitations

  • Research checkpoint only. This model is a proof-of-concept for structured scalar potentials in pairwise-force architectures, not a production system.
  • TinyStories only. Trained exclusively on synthetic children's stories (~5M tokens). Not suitable for general-purpose language generation.
  • English only. No multilingual capability.
  • Small scale. 17.3M parameters, 256-dim hidden states.
  • No safety training. No RLHF, DPO, or safety filtering has been applied.
  • V_phi still uses autograd. Only the VθV_\theta gradient is analytical; the VϕV_\phi pairwise force still requires torch.autograd.grad. The overall training speedup is therefore partial (\(V_\phi\) dominates the cost).

Citation

@misc{Gueorguiev2026SemSim,
  author    = {Gueorguiev, Dimitar P.},
  title     = {Semantic Simulation: A Prescriptive Lagrangian Framework
               for Efficient Semantic Inference --- A Conservative-by-
               Construction Language Model and the Shared-Potential
               Separator, with a Correspondence to Joint Embedding
               Predictive Architectures},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.19712427},
  url       = {https://doi.org/10.5281/zenodo.19712427},
  note      = {Version v15 (Jun 7, 2026).
               Companion code repository (DOI 10.5281/zenodo.20579561):
               \url{https://github.com/dimitarpg13/semsimula-paper}}
}

Environmental Impact

  • Hardware: NVIDIA A100 40GB (Google Colab)
  • Training time: ~2.5 hours (16,000 steps, A2 arm)
  • Carbon footprint: Estimated less than 1.5 kg CO2
Downloads last month
91
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train dimitarpg13/semsimula-parflm-multixi-structured-vtheta

Collection including dimitarpg13/semsimula-parflm-multixi-structured-vtheta

Evaluation results