- Multi-Xi PARFLM with Structured V_theta (SQ3 Mixture of Quadratic Wells)
Multi-Xi PARFLM with Structured V_theta (SQ3 Mixture of Quadratic Wells)
A structured variant of the Multi-Xi PARFLM conservative language model in which the MLP scalar potential is replaced by a mixture of K=8 diagonal quadratic wells (SQ3), while retaining the full MLP-based pairwise potential . This replacement yields:
- Near-zero PPL cost — 12.27 PPL vs the MLP baseline's 12.10 PPL (only 0.17 PPL gap, 0.6% excess cross-entropy). The pairwise absorbs nearly all of the nonlinear structure that the quadratic wells cannot express.
- Full analytical gradients for — no
torch.autograd.gradneeded for the scalar potential force, eliminating the second-order computation graph for the component - Explicit attractor centres — the 8 semantic attractors are readable directly from the model parameters, with no gradient-descent extraction required
- Compressed landscape — with carrying the bulk of the force budget, collapses to a near-flat bias field (mean 0.02, range 31.4 vs 644.9 for SPLM)
This model is from the Semantic Simulation framework.
Table of Contents
- When to Use This Model
- Architecture
- Structured V_theta with Pairwise V_phi: Why the Gap Nearly Vanishes
- How to Get Started
- Training Details
- Evaluation Results
- SPLM Family Overview
- Bias, Risks, and Limitations
- Citation
When to Use This Model
Choose this structured variant over the MLP-based Multi-Xi PARFLM when:
| Priority | Structured V_theta (this model) | MLP V_theta (baseline) |
|---|---|---|
| Interpretability | 8 explicit attractor centres, zero-cost basin readout | Black-box; requires 1,500-step GD extraction per prompt |
| Inference speed | ~2x faster V_theta force computation (analytical gradient) | Standard (autograd for both V_theta and V_phi) |
| Raw PPL | 12.27 | 12.10 |
| PPL gap | 0.17 PPL (0.6%) — essentially free | — |
| Memory | No second-order graph for V_theta (V_phi graph retained) | Full graph for both |
Bottom line: for PARFLM, structured is the recommended default — the 0.17 PPL gap is negligible, while the interpretability and speed gains are substantial.
Architecture
Input tokens x_1, ..., x_T
|
Embedding E[x] + positional encoding
|
For each of L=8 integration steps:
|
+-- K-EMA channels: xi^(k)_t = causal_ema(h, alpha_k) [K_xi=4 channels]
|
+-- Structured V_theta (SQ3):
| xi_flat = flatten(xi_1..xi_K) [K_xi * d = 1024]
| V = -tau * logsumexp_k(-E_k/tau + log pi_k) [K_mix=8 wells]
| f_theta = -analytical_grad_h V [closed-form]
|
+-- Pairwise V_phi (competitive structural MLP):
| scores = score_net(h_t, h_s) [for all s <= t]
| top-k selection via Gumbel-softmax [k=8 neighbours]
| f_phi = -grad_h V_phi(h_t, h_s) [autograd, sparse]
|
+-- Total force: f = f_theta + f_phi
|
+-- Damped Euler step: v += dt*f/m; v /= (1 + dt*gamma); h += dt*v
|
+-- LayerNorm(h)
|
Logits = h @ E^T [tied embeddings]
| Parameter | Value |
|---|---|
| Hidden dim (d) | 256 |
| Layers (L) | 8 |
| V_theta kind | SQ3 (mixture of K quadratic wells) |
| Mixture components (K_mix) | 8 |
| Temperature (tau) | 1.0 |
| Xi channels (K_xi) | 4 |
| V_phi kind | structural_competitive |
| V_phi hidden | 128 |
| Top-k (sparse routing) | 8 |
| Gumbel tau | 1.0 (init), 0.3 (min) |
| Gathered V_phi | Yes |
| Per-layer V_phi scale | Yes |
| LN before distance | Yes |
| Layer checkpoint | Yes |
| Mass model | logfreq (frozen surprisal lookup) |
| Damping gamma | 0.30 (fixed) |
| lambda_V (V_theta regularisation) | 0.01 |
| Total parameters | 17,335,568 |
Structured V_theta with Pairwise V_phi: Why the Gap Nearly Vanishes
In the standalone Multi-Xi SPLM, replacing the MLP with SQ3 incurs a 1.82 PPL gap (13.33 vs 11.51). In PARFLM, the same replacement incurs only a 0.17 PPL gap (12.27 vs 12.10) --- an order of magnitude smaller.
The reason is landscape compression: with carrying the bulk of the force budget, collapses to a near-flat bias field:
| Landscape metric | SPLM structured V_theta | PARFLM structured V_theta |
|---|---|---|
| Mean V_theta | 99.8 | 0.02 |
| Std V_theta | 26.3 | 0.51 |
| Range | 644.9 | 31.4 |
| PPL gap vs MLP | 1.82 (5.5%) | 0.17 (0.6%) |
The MLP's expressivity advantage is irrelevant when 's dynamic range is negligible. The pairwise absorbs the nonlinear structure that quadratic wells cannot express, making structured essentially free in PARFLM.
The force from the structured is computed in closed form:
where are the softmax responsibilities over the 8 quadratic wells. The force still uses autograd (sparse, over top-k=8 neighbours only).
For full derivations, all four structured variants (SQ1--SQ4), landscape compression analysis, attractor basin decoding, and hyperparameter selection strategies, see the companion note: Structured_VTheta_Design_and_Theory.md.
How to Get Started
import torch, sys
sys.path.insert(0, "multixi")
sys.path.insert(0, "parf")
sys.path.insert(0, "energetic_minima")
sys.path.insert(0, "sarf_mass_variant")
from parf.model_parf_multixi import MultiXiPARFLM, MultiXiPARFConfig
from parf.model_structured_vtheta import MixtureQuadraticVTheta
from parf.model_structured_vtheta_multixi import StructuredVThetaMultiXiAdapter
# -- Build base model --
config = MultiXiPARFConfig(
vocab_size=50257, d=256, n_layers=8,
v_hidden=1024, v_depth=3,
max_len=1024, block_size=512,
gamma=0.30, xi_channels=4,
xi_alpha_inits=[0.25, 0.5, 0.75, 0.95],
xi_learnable=True, mass_mode="logfreq",
logfreq_path="logfreq_surprisal_tinystories.npy",
v_phi_kind="structural_competitive",
v_phi_phi_hidden=128, v_phi_theta_hidden=128,
top_k=8, score_head_hidden=32,
gumbel_tau_init=1.0, gumbel_tau_min=0.3,
gumbel_noise=True,
use_gathered_v_phi=True,
use_layer_checkpoint=True,
ln_before_distance=True,
per_layer_v_phi_scale=True,
)
model = MultiXiPARFLM(config)
# -- Swap in structured V_theta --
K_xi, d = 4, 256
inner = MixtureQuadraticVTheta(d=K_xi * d, K=8, tau=1.0)
model.V_theta = StructuredVThetaMultiXiAdapter(inner, K=K_xi, d=d)
# -- Load checkpoint --
from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download(
repo_id="dimitarpg13/semsimula-parflm-multixi-structured-vtheta",
filename="checkpoint/ckpt_best.pt",
)
state = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(state["model_state_dict"])
model.eval()
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
# -- Read attractor centres directly --
x = torch.randint(0, 50257, (1, 64))
with torch.no_grad():
h = model._embed(x)
xis = model._compute_xis(h) # (1, 64, 4, 256)
centres = model.V_theta.attractor_centres(xis) # (1, 64, 8, 1024)
print(f"Attractor centres shape: {centres.shape}") # 8 basins per token
Available Artifacts
| File | Description |
|---|---|
checkpoint/ckpt_best.pt |
Best checkpoint (A2 arm, 12.27 PPL at step 14,400) |
training_log.jsonl |
Per-step training metrics (40 eval points) |
training_curve_A2.png |
Training/validation loss curves |
v_theta_hist_A2.png |
V_theta output distribution histogram |
landscape_stats_A2.json |
V_theta landscape statistics (mean, std, range) |
model_structured_vtheta.py |
Structured V_theta classes (SQ1--SQ4) |
model_structured_vtheta_multixi.py |
Multi-Xi adapter |
config.json |
Model configuration |
Training Details
Training Data
TinyStories --- a synthetic corpus of short children's stories generated by GPT-3.5/4, tokenized with GPT-2 BPE (vocab size 50,257). Training cap: 5M tokens; validation: ~140k tokens.
Training Procedure
The base model architecture is identical to the Multi-Xi PARFLM. The only modification is the replacement: the 3-layer MLP is swapped for the SQ3 mixture at model construction time, before training begins from scratch. The pairwise (competitive structural MLP, hidden=128, top-k=8) is unchanged.
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate | 5e-4 (cosine decay) |
| Warmup steps | 400 |
| Weight decay | 0.01 |
| Gradient clipping | 1.0 |
| Batch size | 16 |
| Block size | 512 |
| Training steps | 16,000 |
| lambda_V (V_theta regularisation) | 0.01 |
| Hardware | NVIDIA A100 40GB (Google Colab) |
Training Script
notebooks/conservative_arch/scaleup/colab_parf_multixi_structured_vtheta.ipynb --- Colab notebook with structured V_theta arms, GDrive output, checkpointing, and live progress display.
Evaluation Results
TinyStories Validation Perplexity
| Model | PPL | Params | Analytical V_theta grad | V_theta--MLP gap |
|---|---|---|---|---|
| Matched Attention (baseline) | 7.81 | 19.5M | --- | --- |
| Fock-PARFLM v2.1 | 9.30 | 17.4M | No | --- |
| Multi-Xi SPLM (MLP) | 11.51 | 16.5M | No | --- |
| Multi-Xi PARFLM (MLP) | 12.06 | 17.6M | No | --- |
| Multi-Xi PARFLM (SQ3, this model) | 12.27 | 17.3M | Yes | 0.17 PPL |
| Multi-Xi SPLM (SQ3) | 13.33 | 17.3M | Yes | 1.82 PPL |
V_theta Landscape Statistics
| Metric | This model (PARFLM) | SPLM structured V_theta |
|---|---|---|
| Mean V_theta | 0.02 | 99.8 |
| Std V_theta | 0.51 | 26.3 |
| Range | 31.4 | 644.9 |
Learned Xi-Channel Decay Rates
The final learned alpha values are nearly identical to both the SPLM structured V_theta and the MLP PARFLM baseline, confirming that the causal EMA context structure is robust to the parameterisation change.
SPLM Family Overview
This model is part of the Semantic Simulation SPLM family:
| Model | Design | PPL | HuggingFace |
|---|---|---|---|
| Multi-Xi SPLM (MLP) | Pure scalar potential | 11.51 | semsimula-splm-multixi |
| Multi-Xi SPLM (SQ3) | Structured scalar potential | 13.33 | semsimula-splm-multixi-structured-vtheta |
| Multi-Xi PARFLM (MLP) | Scalar + pairwise forces | 12.06 | semsimula-parflm-multixi |
| Multi-Xi PARFLM (SQ3) | Structured scalar + pairwise | 12.27 | this model |
| Fock-PARFLM v2.1 | PARFLM + Fock registers | 9.30 | semsimula-fock-parflm |
| Hybrid SPLM+Attn | Attention + SPLM refinement | 8.50 | semsimula-hybrid-splm |
Collection: Semantic Simulation SPLM Model Family
Bias, Risks, and Limitations
- Research checkpoint only. This model is a proof-of-concept for structured scalar potentials in pairwise-force architectures, not a production system.
- TinyStories only. Trained exclusively on synthetic children's stories (~5M tokens). Not suitable for general-purpose language generation.
- English only. No multilingual capability.
- Small scale. 17.3M parameters, 256-dim hidden states.
- No safety training. No RLHF, DPO, or safety filtering has been applied.
- V_phi still uses autograd. Only the gradient is analytical; the pairwise force still requires
torch.autograd.grad. The overall training speedup is therefore partial (\(V_\phi\) dominates the cost).
Citation
@misc{Gueorguiev2026SemSim,
author = {Gueorguiev, Dimitar P.},
title = {Semantic Simulation: A Prescriptive Lagrangian Framework
for Efficient Semantic Inference --- A Conservative-by-
Construction Language Model and the Shared-Potential
Separator, with a Correspondence to Joint Embedding
Predictive Architectures},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.19712427},
url = {https://doi.org/10.5281/zenodo.19712427},
note = {Version v15 (Jun 7, 2026).
Companion code repository (DOI 10.5281/zenodo.20579561):
\url{https://github.com/dimitarpg13/semsimula-paper}}
}
Environmental Impact
- Hardware: NVIDIA A100 40GB (Google Colab)
- Training time: ~2.5 hours (16,000 steps, A2 arm)
- Carbon footprint: Estimated less than 1.5 kg CO2
- Downloads last month
- 91
Dataset used to train dimitarpg13/semsimula-parflm-multixi-structured-vtheta
Collection including dimitarpg13/semsimula-parflm-multixi-structured-vtheta
Evaluation results
- Validation Perplexity (SQ3 K=8) on TinyStoriesvalidation set self-reported12.270