Fock-PARFLM v2.1 with Structured V_theta (SQ3 Mixture of Quadratic Wells)

A structured variant of the Fock-PARFLM v2.1 conservative language model in which the MLP scalar potential VθV_\theta is replaced by a mixture of K=8 diagonal quadratic wells (SQ3), while retaining the full MLP-based pairwise potential VϕV_\phi and the Fock register mechanism (16 registers, LIFO stack discipline, reverse channel). This replacement yields:

  • Full analytical gradients for VθV_\theta — no torch.autograd.grad needed for the scalar potential force, eliminating the second-order computation graph for the VθV_\theta component
  • Explicit attractor centres — the 8 semantic attractors μk(ξ)\mu_k(\xi) are readable directly from the model parameters, with no gradient-descent extraction required
  • Most compressed VθV_\theta landscape — with both VÏ•V_\phi and Fock registers carrying the force budget, VθV_\theta collapses to the flattest bias field observed across all three architectures (mean 0.008, range 19.1)

The trade-off is a 1.06 PPL gap: 10.36 PPL vs the MLP baseline's 9.30 PPL (11.4% excess cross-entropy). This is larger than the PARFLM gap (0.17 PPL) despite even greater landscape compression — the "Fock paradox" discussed below.

This model is from the Semantic Simulation framework.

Table of Contents

When to Use This Model

Choose this structured variant over the MLP-based Fock-PARFLM v2.1 when:

Priority Structured V_theta (this model) MLP V_theta (baseline)
Interpretability 8 explicit attractor centres, zero-cost basin readout Black-box; requires 1,500-step GD extraction per prompt
Inference speed ~2x faster V_theta force computation (analytical gradient) Standard (autograd for both V_theta and V_phi)
Raw PPL 10.36 9.30
PPL gap 1.06 PPL (11.4%) —
Memory No second-order graph for V_theta (V_phi graph retained) Full graph for both

Bottom line: for Fock-PARFLM, structured VθV_\theta is a viable option when interpretability or analytical-gradient inference is valued — the 1.06 PPL cost is meaningful but the attractor readout and speed gains may justify it. For maximum PPL, use the MLP baseline.

Architecture

Input tokens x_1, ..., x_T
       |
   Embedding E[x] + positional encoding
       |
   For each of L=8 integration steps:
       |
       +-- K-EMA channels: xi^(k)_t = causal_ema(h, alpha_k)   [K_xi=4 channels]
       |
       +-- Structured V_theta (SQ3):
       |     xi_flat = flatten(xi_1..xi_K)                      [K_xi * d = 1024]
       |     V = -tau * logsumexp_k(-E_k/tau + log pi_k)        [K_mix=8 wells]
       |     f_theta = -analytical_grad_h V                     [closed-form]
       |
       +-- Pairwise V_phi (competitive structural MLP):
       |     scores = score_net(h_t, h_s)                       [for all s <= t]
       |     top-k selection via Gumbel-softmax                 [k=8 neighbours]
       |     f_phi = -grad_h V_phi(h_t, h_s)                   [autograd, sparse]
       |
       +-- Fock register pool (v2.1):
       |     M=16 virtual registers with Q/K/V creation gates
       |     LIFO stack discipline, salience decay
       |     Per-register tau and key subspaces
       |     Reverse channel (non-conservative exchange)
       |     f_fock = creation + destruction + exchange forces
       |
       +-- Total force: f = f_theta + f_phi + f_fock
       |
       +-- Damped Euler step: v += dt*f/m; v /= (1 + dt*gamma); h += dt*v
       |
       +-- LayerNorm(h)
       |
   Logits = h @ E^T                                            [tied embeddings]
Parameter Value
Hidden dim (d) 256
Layers (L) 8
V_theta kind SQ3 (mixture of K quadratic wells)
Mixture components (K_mix) 8
Temperature (tau) 1.0
Xi channels (K_xi) 4
V_phi kind structural_competitive
V_phi hidden 128
Top-k (sparse routing) 8
Gumbel tau 1.0 (init), 0.3 (min)
Fock version v2.1
Registers (M) 16
Register d_k 64
Stack discipline LIFO
Reverse channel Yes
Per-register tau/keys Yes
Gathered V_phi Yes
Per-layer V_phi scale Yes
LN before distance Yes
Layer checkpoint Yes
Mass model logfreq (frozen surprisal lookup)
Damping gamma 0.30 (fixed)
lambda_V (V_theta regularisation) 0.01
Total parameters 17,407,980

The Fock Paradox: Maximal Compression, Moderate Gap

Across the three SPLM-family architectures tested with structured VθV_\theta, the Fock-PARFLM exhibits a striking paradox: the flattest VθV_\theta landscape yet the second-largest expressivity gap.

Architecture Structured V_theta PPL MLP baseline PPL Gap Gap (%) Mean V_theta Range
Multi-Xi SPLM (SQ3) 13.33 11.51 1.82 5.5% 99.8 644.9
Fock-PARFLM v2.1 (SQ3, this model) 10.36 9.30 1.06 11.4% 0.008 19.1
Multi-Xi PARFLM (SQ3) 12.27 12.10 0.17 0.6% 0.02 31.4

The landscape compression is monotonic (SPLM > PARFLM > Fock-PARFLM), but the expressivity gap is non-monotonic: PARFLM achieves the smallest gap despite a less compressed landscape than Fock-PARFLM.

Why? At 9.30 PPL, the Fock model operates closer to the dataset's entropy floor, where the marginal value of each nat of VθV_\theta precision is higher. The Fock register mechanism (creation/destruction operators, stack discipline, reverse channel) creates a more structured dynamical regime where even a near-flat VθV_\theta must provide fine-grained corrections that the diagonal quadratic parameterisation cannot match. The relationship between landscape compression and expressivity gap is therefore architecture-dependent, modulated by proximity to the entropy floor.

The force from the structured VθV_\theta is computed in closed form:

fθ=−∇hVθ=−∑k=1Kqk(ξ,h)⋅ak(ξ)⊙(h−μk(ξ))f_\theta = -\nabla_h V_\theta = -\sum_{k=1}^{K} q_k(\xi, h) \cdot a_k(\xi) \odot (h - \mu_k(\xi))

where qkq_k are the softmax responsibilities over the 8 quadratic wells. The VϕV_\phi force still uses autograd (sparse, over top-k=8 neighbours only), and the Fock forces use their own differentiable computation.

For full derivations, all four structured variants (SQ1--SQ4), landscape compression analysis, attractor basin decoding, and hyperparameter selection strategies, see the companion note: Structured_VTheta_Design_and_Theory.md.

How to Get Started

import torch, sys
sys.path.insert(0, "multixi")
sys.path.insert(0, "parf")
sys.path.insert(0, "energetic_minima")
sys.path.insert(0, "sarf_mass_variant")

from parf.model_fock_parf_multixi import FockMultiXiPARFLM, FockMultiXiPARFConfig
from parf.model_structured_vtheta import MixtureQuadraticVTheta
from parf.model_structured_vtheta_multixi import StructuredVThetaMultiXiAdapter

# -- Build base model --
config = FockMultiXiPARFConfig(
    vocab_size=50257, d=256, n_layers=8,
    v_hidden=1024, v_depth=3,
    max_len=1024, block_size=512,
    gamma=0.30, xi_channels=4,
    xi_alpha_inits=[0.25, 0.5, 0.75, 0.95],
    xi_learnable=True, mass_mode="logfreq",
    logfreq_path="logfreq_surprisal_tinystories.npy",
    v_phi_kind="structural_competitive",
    v_phi_phi_hidden=128, v_phi_theta_hidden=128,
    top_k=8, score_head_hidden=32,
    gumbel_tau_init=1.0, gumbel_tau_min=0.3,
    gumbel_noise=True,
    use_gathered_v_phi=True,
    use_layer_checkpoint=True,
    ln_before_distance=True,
    per_layer_v_phi_scale=True,
    n_registers=16, register_d_k=64,
    register_tau_create_init=8.0,
    register_salience_decay=0.5,
    register_salience_threshold=0.01,
    register_stack_discipline="lifo",
    use_reverse_channel=True,
)
model = FockMultiXiPARFLM(config)

# -- Swap in structured V_theta --
K_xi, d = 4, 256
inner = MixtureQuadraticVTheta(d=K_xi * d, K=8, tau=1.0)
model.V_theta = StructuredVThetaMultiXiAdapter(inner, K=K_xi, d=d)

# -- Load checkpoint --
from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download(
    repo_id="dimitarpg13/semsimula-fock-parflm-structured-vtheta",
    filename="checkpoint/ckpt_best.pt",
)
state = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(state["model_state_dict"])
model.eval()

print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

# -- Read attractor centres directly --
x = torch.randint(0, 50257, (1, 64))
with torch.no_grad():
    h = model._embed(x)
    xis = model._compute_xis(h)                        # (1, 64, 4, 256)
    centres = model.V_theta.attractor_centres(xis)      # (1, 64, 8, 1024)
    print(f"Attractor centres shape: {centres.shape}")   # 8 basins per token

Available Artifacts

File Description
checkpoint/ckpt_best.pt Best checkpoint (A2 arm, 10.36 PPL at step 14,400)
training_log.jsonl Per-step training metrics (40 eval points)
training_curve_A2.png Training/validation loss curves
v_theta_hist_A2.png V_theta output distribution histogram
landscape_stats_A2.json V_theta landscape statistics (mean, std, range)
model_structured_vtheta.py Structured V_theta classes (SQ1--SQ4)
model_structured_vtheta_multixi.py Multi-Xi adapter
config.json Model configuration

Training Details

Training Data

TinyStories --- a synthetic corpus of short children's stories generated by GPT-3.5/4, tokenized with GPT-2 BPE (vocab size 50,257). Training cap: 5M tokens; validation: ~140k tokens.

Training Procedure

The base model architecture is identical to the Fock-PARFLM v2.1. The only modification is the VθV_\theta replacement: the 3-layer MLP is swapped for the SQ3 mixture at model construction time, before training begins from scratch. The pairwise VϕV_\phi (competitive structural MLP, hidden=128, top-k=8) and Fock register pool (16 registers, LIFO, reverse channel) are unchanged.

Hyperparameter Value
Optimizer AdamW
Learning rate 5e-4 (cosine decay)
Warmup steps 400
Weight decay 0.01
Gradient clipping 1.0
Batch size 16
Block size 512
Training steps 16,000
lambda_V (V_theta regularisation) 0.01
Hardware NVIDIA A100 40GB (Google Colab)

Training Script

notebooks/conservative_arch/scaleup/colab_fock_multixi_structured_vtheta.ipynb --- Colab notebook with structured V_theta arms, GDrive output, checkpointing, and live progress display.

Evaluation Results

TinyStories Validation Perplexity

Model PPL Params Analytical V_theta grad V_theta--MLP gap
Matched Attention (baseline) 7.81 19.5M --- ---
Fock-PARFLM v2.1 (MLP) 9.30 17.4M No ---
Fock-PARFLM v2.1 (SQ3, this model) 10.36 17.4M Yes 1.06 PPL
Multi-Xi SPLM (MLP) 11.51 16.5M No ---
Multi-Xi PARFLM (MLP) 12.06 17.6M No ---
Multi-Xi PARFLM (SQ3) 12.27 17.3M Yes 0.17 PPL
Multi-Xi SPLM (SQ3) 13.33 17.3M Yes 1.82 PPL

V_theta Landscape Statistics

Metric This model (Fock-PARFLM) PARFLM structured V_theta SPLM structured V_theta
Mean V_theta 0.008 0.02 99.8
Std V_theta 0.26 0.51 26.3
Range 19.1 31.4 644.9

Learned Xi-Channel Decay Rates

The final learned alpha values [α1,…,α4]=[0.14,0.56,0.79,0.95][\alpha_1, \ldots, \alpha_4] = [0.14, 0.56, 0.79, 0.95] are stable and consistent with the SPLM [0.12,0.59,0.84,0.97][0.12, 0.59, 0.84, 0.97] and PARFLM [0.11,0.55,0.81,0.97][0.11, 0.55, 0.81, 0.97] values, confirming that the causal EMA context structure is invariant to the VθV_\theta parameterisation, the presence of VϕV_\phi, and the Fock dynamics.

SPLM Family Overview

This model is part of the Semantic Simulation SPLM family:

Model Design PPL HuggingFace
Multi-Xi SPLM (MLP) Pure scalar potential 11.51 semsimula-splm-multixi
Multi-Xi SPLM (SQ3) Structured scalar potential 13.33 semsimula-splm-multixi-structured-vtheta
Multi-Xi PARFLM (MLP) Scalar + pairwise forces 12.06 semsimula-parflm-multixi
Multi-Xi PARFLM (SQ3) Structured scalar + pairwise 12.27 semsimula-parflm-multixi-structured-vtheta
Fock-PARFLM v2.1 (MLP) PARFLM + Fock registers 9.30 semsimula-fock-parflm
Fock-PARFLM v2.1 (SQ3) Structured + pairwise + Fock 10.36 this model
Hybrid SPLM+Attn Attention + SPLM refinement 8.50 semsimula-hybrid-splm

Collection: Semantic Simulation SPLM Model Family

Bias, Risks, and Limitations

  • Research checkpoint only. This model is a proof-of-concept for structured scalar potentials in Fock-augmented architectures, not a production system.
  • TinyStories only. Trained exclusively on synthetic children's stories (~5M tokens). Not suitable for general-purpose language generation.
  • English only. No multilingual capability.
  • Small scale. 17.4M parameters, 256-dim hidden states.
  • No safety training. No RLHF, DPO, or safety filtering has been applied.
  • V_phi and Fock forces still use autograd. Only the VθV_\theta gradient is analytical; the VÏ•V_\phi pairwise force and Fock register forces still require torch.autograd.grad. The overall training speedup is therefore partial (\(V_\phi\) and Fock forces dominate the cost).
  • Moderate expressivity gap. The 1.06 PPL gap (11.4% excess CE) is larger than the PARFLM structured V_theta gap (0.17 PPL), reflecting the higher precision demands of the Fock architecture near the dataset's entropy floor.

Citation

@misc{Gueorguiev2026SemSim,
  author    = {Gueorguiev, Dimitar P.},
  title     = {Semantic Simulation: A Prescriptive Lagrangian Framework
               for Efficient Semantic Inference --- A Conservative-by-
               Construction Language Model and the Shared-Potential
               Separator, with a Correspondence to Joint Embedding
               Predictive Architectures},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.19712427},
  url       = {https://doi.org/10.5281/zenodo.19712427},
  note      = {Version v15 (Jun 7, 2026).
               Companion code repository (DOI 10.5281/zenodo.20579561):
               \url{https://github.com/dimitarpg13/semsimula-paper}}
}

Environmental Impact

  • Hardware: NVIDIA A100 40GB (Google Colab)
  • Training time: ~3 hours (16,000 steps, A2 arm)
  • Carbon footprint: Estimated less than 2 kg CO2
Downloads last month
70
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train dimitarpg13/semsimula-fock-parflm-structured-vtheta

Collection including dimitarpg13/semsimula-fock-parflm-structured-vtheta

Evaluation results