Fock-PARFLM v2.1 with Structured V_theta (SQ3 Mixture of Quadratic Wells)

A structured variant of the Fock-PARFLM v2.1 conservative language model in which the MLP scalar potential $V_\theta$ is replaced by a mixture of K=8 diagonal quadratic wells (SQ3), while retaining the full MLP-based pairwise potential $V_\phi$ and the Fock register mechanism (16 registers, LIFO stack discipline, reverse channel). This replacement yields:

Full analytical gradients for $V_\theta$ — no torch.autograd.grad needed for the scalar potential force, eliminating the second-order computation graph for the $V_\theta$ component
Explicit attractor centres — the 8 semantic attractors $\mu_k(\xi)$ are readable directly from the model parameters, with no gradient-descent extraction required
Most compressed $V_\theta$ landscape — with both $V_\phi$ and Fock registers carrying the force budget, $V_\theta$ collapses to the flattest bias field observed across all three architectures (mean 0.008, range 19.1)

The trade-off is a 1.06 PPL gap: 10.36 PPL vs the MLP baseline's 9.30 PPL (11.4% excess cross-entropy). This is larger than the PARFLM gap (0.17 PPL) despite even greater landscape compression — the "Fock paradox" discussed below.

This model is from the Semantic Simulation framework.

When to Use This Model
Architecture
The Fock Paradox: Maximal Compression, Moderate Gap
How to Get Started
Training Details
Evaluation Results
SPLM Family Overview
Bias, Risks, and Limitations
Citation

When to Use This Model

Choose this structured variant over the MLP-based Fock-PARFLM v2.1 when:

Priority	Structured V_theta (this model)	MLP V_theta (baseline)
Interpretability	8 explicit attractor centres, zero-cost basin readout	Black-box; requires 1,500-step GD extraction per prompt
Inference speed	~2x faster V_theta force computation (analytical gradient)	Standard (autograd for both V_theta and V_phi)
Raw PPL	10.36	9.30
PPL gap	1.06 PPL (11.4%)	—
Memory	No second-order graph for V_theta (V_phi graph retained)	Full graph for both

Bottom line: for Fock-PARFLM, structured $V_\theta$ is a viable option when interpretability or analytical-gradient inference is valued — the 1.06 PPL cost is meaningful but the attractor readout and speed gains may justify it. For maximum PPL, use the MLP baseline.

Architecture

Input tokens x_1, ..., x_T
       |
   Embedding E[x] + positional encoding
       |
   For each of L=8 integration steps:
       |
       +-- K-EMA channels: xi^(k)_t = causal_ema(h, alpha_k)   [K_xi=4 channels]
       |
       +-- Structured V_theta (SQ3):
       |     xi_flat = flatten(xi_1..xi_K)                      [K_xi * d = 1024]
       |     V = -tau * logsumexp_k(-E_k/tau + log pi_k)        [K_mix=8 wells]
       |     f_theta = -analytical_grad_h V                     [closed-form]
       |
       +-- Pairwise V_phi (competitive structural MLP):
       |     scores = score_net(h_t, h_s)                       [for all s <= t]
       |     top-k selection via Gumbel-softmax                 [k=8 neighbours]
       |     f_phi = -grad_h V_phi(h_t, h_s)                   [autograd, sparse]
       |
       +-- Fock register pool (v2.1):
       |     M=16 virtual registers with Q/K/V creation gates
       |     LIFO stack discipline, salience decay
       |     Per-register tau and key subspaces
       |     Reverse channel (non-conservative exchange)
       |     f_fock = creation + destruction + exchange forces
       |
       +-- Total force: f = f_theta + f_phi + f_fock
       |
       +-- Damped Euler step: v += dt*f/m; v /= (1 + dt*gamma); h += dt*v
       |
       +-- LayerNorm(h)
       |
   Logits = h @ E^T                                            [tied embeddings]

Parameter	Value
Hidden dim (d)	256
Layers (L)	8
V_theta kind	SQ3 (mixture of K quadratic wells)
Mixture components (K_mix)	8
Temperature (tau)	1.0
Xi channels (K_xi)	4
V_phi kind	structural_competitive
V_phi hidden	128
Top-k (sparse routing)	8
Gumbel tau	1.0 (init), 0.3 (min)
Fock version	v2.1
Registers (M)	16
Register d_k	64
Stack discipline	LIFO
Reverse channel	Yes
Per-register tau/keys	Yes
Gathered V_phi	Yes
Per-layer V_phi scale	Yes
LN before distance	Yes
Layer checkpoint	Yes
Mass model	logfreq (frozen surprisal lookup)
Damping gamma	0.30 (fixed)
lambda_V (V_theta regularisation)	0.01
Total parameters	17,407,980

The Fock Paradox: Maximal Compression, Moderate Gap

Across the three SPLM-family architectures tested with structured $V_\theta$ , the Fock-PARFLM exhibits a striking paradox: the flattest $V_\theta$ landscape yet the second-largest expressivity gap.

Architecture	Structured V_theta PPL	MLP baseline PPL	Gap	Gap (%)	Mean V_theta	Range
Multi-Xi SPLM (SQ3)	13.33	11.51	1.82	5.5%	99.8	644.9
Fock-PARFLM v2.1 (SQ3, this model)	10.36	9.30	1.06	11.4%	0.008	19.1
Multi-Xi PARFLM (SQ3)	12.27	12.10	0.17	0.6%	0.02	31.4

The landscape compression is monotonic (SPLM > PARFLM > Fock-PARFLM), but the expressivity gap is non-monotonic: PARFLM achieves the smallest gap despite a less compressed landscape than Fock-PARFLM.

Why? At 9.30 PPL, the Fock model operates closer to the dataset's entropy floor, where the marginal value of each nat of $V_\theta$ precision is higher. The Fock register mechanism (creation/destruction operators, stack discipline, reverse channel) creates a more structured dynamical regime where even a near-flat $V_\theta$ must provide fine-grained corrections that the diagonal quadratic parameterisation cannot match. The relationship between landscape compression and expressivity gap is therefore architecture-dependent, modulated by proximity to the entropy floor.

The force from the structured $V_\theta$ is computed in closed form:

$f_\theta = -\nabla_h V_\theta = -\sum_{k=1}^{K} q_k(\xi, h) \cdot a_k(\xi) \odot (h - \mu_k(\xi))$

where $q_{k}$ are the softmax responsibilities over the 8 quadratic wells. The $V_\phi$ force still uses autograd (sparse, over top-k=8 neighbours only), and the Fock forces use their own differentiable computation.

For full derivations, all four structured variants (SQ1--SQ4), landscape compression analysis, attractor basin decoding, and hyperparameter selection strategies, see the companion note: Structured_VTheta_Design_and_Theory.md.

How to Get Started

import torch, sys
sys.path.insert(0, "multixi")
sys.path.insert(0, "parf")
sys.path.insert(0, "energetic_minima")
sys.path.insert(0, "sarf_mass_variant")

from parf.model_fock_parf_multixi import FockMultiXiPARFLM, FockMultiXiPARFConfig
from parf.model_structured_vtheta import MixtureQuadraticVTheta
from parf.model_structured_vtheta_multixi import StructuredVThetaMultiXiAdapter

# -- Build base model --
config = FockMultiXiPARFConfig(
    vocab_size=50257, d=256, n_layers=8,
    v_hidden=1024, v_depth=3,
    max_len=1024, block_size=512,
    gamma=0.30, xi_channels=4,
    xi_alpha_inits=[0.25, 0.5, 0.75, 0.95],
    xi_learnable=True, mass_mode="logfreq",
    logfreq_path="logfreq_surprisal_tinystories.npy",
    v_phi_kind="structural_competitive",
    v_phi_phi_hidden=128, v_phi_theta_hidden=128,
    top_k=8, score_head_hidden=32,
    gumbel_tau_init=1.0, gumbel_tau_min=0.3,
    gumbel_noise=True,
    use_gathered_v_phi=True,
    use_layer_checkpoint=True,
    ln_before_distance=True,
    per_layer_v_phi_scale=True,
    n_registers=16, register_d_k=64,
    register_tau_create_init=8.0,
    register_salience_decay=0.5,
    register_salience_threshold=0.01,
    register_stack_discipline="lifo",
    use_reverse_channel=True,
)
model = FockMultiXiPARFLM(config)

# -- Swap in structured V_theta --
K_xi, d = 4, 256
inner = MixtureQuadraticVTheta(d=K_xi * d, K=8, tau=1.0)
model.V_theta = StructuredVThetaMultiXiAdapter(inner, K=K_xi, d=d)

# -- Load checkpoint --
from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download(
    repo_id="dimitarpg13/semsimula-fock-parflm-structured-vtheta",
    filename="checkpoint/ckpt_best.pt",
)
state = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(state["model_state_dict"])
model.eval()

print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

# -- Read attractor centres directly --
x = torch.randint(0, 50257, (1, 64))
with torch.no_grad():
    h = model._embed(x)
    xis = model._compute_xis(h)                        # (1, 64, 4, 256)
    centres = model.V_theta.attractor_centres(xis)      # (1, 64, 8, 1024)
    print(f"Attractor centres shape: {centres.shape}")   # 8 basins per token

Available Artifacts

File	Description
`checkpoint/ckpt_best.pt`	Best checkpoint (A2 arm, 10.36 PPL at step 14,400)
`training_log.jsonl`	Per-step training metrics (40 eval points)
`training_curve_A2.png`	Training/validation loss curves
`v_theta_hist_A2.png`	V_theta output distribution histogram
`landscape_stats_A2.json`	V_theta landscape statistics (mean, std, range)
`model_structured_vtheta.py`	Structured V_theta classes (SQ1--SQ4)
`model_structured_vtheta_multixi.py`	Multi-Xi adapter
`config.json`	Model configuration

Training Details

Training Data

TinyStories --- a synthetic corpus of short children's stories generated by GPT-3.5/4, tokenized with GPT-2 BPE (vocab size 50,257). Training cap: 5M tokens; validation: ~140k tokens.

Training Procedure

The base model architecture is identical to the Fock-PARFLM v2.1. The only modification is the $V_\theta$ replacement: the 3-layer MLP is swapped for the SQ3 mixture at model construction time, before training begins from scratch. The pairwise $V_\phi$ (competitive structural MLP, hidden=128, top-k=8) and Fock register pool (16 registers, LIFO, reverse channel) are unchanged.

Hyperparameter	Value
Optimizer	AdamW
Learning rate	5e-4 (cosine decay)
Warmup steps	400
Weight decay	0.01
Gradient clipping	1.0
Batch size	16
Block size	512
Training steps	16,000
lambda_V (V_theta regularisation)	0.01
Hardware	NVIDIA A100 40GB (Google Colab)

Training Script

notebooks/conservative_arch/scaleup/colab_fock_multixi_structured_vtheta.ipynb --- Colab notebook with structured V_theta arms, GDrive output, checkpointing, and live progress display.

Evaluation Results

TinyStories Validation Perplexity

Model	PPL	Params	Analytical V_theta grad	V_theta--MLP gap
Matched Attention (baseline)	7.81	19.5M	---	---
Fock-PARFLM v2.1 (MLP)	9.30	17.4M	No	---
Fock-PARFLM v2.1 (SQ3, this model)	10.36	17.4M	Yes	1.06 PPL
Multi-Xi SPLM (MLP)	11.51	16.5M	No	---
Multi-Xi PARFLM (MLP)	12.06	17.6M	No	---
Multi-Xi PARFLM (SQ3)	12.27	17.3M	Yes	0.17 PPL
Multi-Xi SPLM (SQ3)	13.33	17.3M	Yes	1.82 PPL

V_theta Landscape Statistics

Metric	This model (Fock-PARFLM)	PARFLM structured V_theta	SPLM structured V_theta
Mean V_theta	0.008	0.02	99.8
Std V_theta	0.26	0.51	26.3
Range	19.1	31.4	644.9

Learned Xi-Channel Decay Rates

The final learned alpha values $[\alpha_1, \ldots, \alpha_4] = [0.14, 0.56, 0.79, 0.95]$ are stable and consistent with the SPLM $[0.12, 0.59, 0.84, 0.97]$ and PARFLM $[0.11, 0.55, 0.81, 0.97]$ values, confirming that the causal EMA context structure is invariant to the $V_\theta$ parameterisation, the presence of $V_\phi$ , and the Fock dynamics.

SPLM Family Overview

This model is part of the Semantic Simulation SPLM family:

Model	Design	PPL	HuggingFace
Multi-Xi SPLM (MLP)	Pure scalar potential	11.51	semsimula-splm-multixi
Multi-Xi SPLM (SQ3)	Structured scalar potential	13.33	semsimula-splm-multixi-structured-vtheta
Multi-Xi PARFLM (MLP)	Scalar + pairwise forces	12.06	semsimula-parflm-multixi
Multi-Xi PARFLM (SQ3)	Structured scalar + pairwise	12.27	semsimula-parflm-multixi-structured-vtheta
Fock-PARFLM v2.1 (MLP)	PARFLM + Fock registers	9.30	semsimula-fock-parflm
Fock-PARFLM v2.1 (SQ3)	Structured + pairwise + Fock	10.36	this model
Hybrid SPLM+Attn	Attention + SPLM refinement	8.50	semsimula-hybrid-splm

Collection: Semantic Simulation SPLM Model Family

Bias, Risks, and Limitations

Research checkpoint only. This model is a proof-of-concept for structured scalar potentials in Fock-augmented architectures, not a production system.
TinyStories only. Trained exclusively on synthetic children's stories (~5M tokens). Not suitable for general-purpose language generation.
English only. No multilingual capability.
Small scale. 17.4M parameters, 256-dim hidden states.
No safety training. No RLHF, DPO, or safety filtering has been applied.
V_phi and Fock forces still use autograd. Only the $V_\theta$ gradient is analytical; the $V_\phi$ pairwise force and Fock register forces still require torch.autograd.grad. The overall training speedup is therefore partial ($V_\phi$ and Fock forces dominate the cost).
Moderate expressivity gap. The 1.06 PPL gap (11.4% excess CE) is larger than the PARFLM structured V_theta gap (0.17 PPL), reflecting the higher precision demands of the Fock architecture near the dataset's entropy floor.

Citation

@misc{Gueorguiev2026SemSim,
  author    = {Gueorguiev, Dimitar P.},
  title     = {Semantic Simulation: A Prescriptive Lagrangian Framework
               for Efficient Semantic Inference --- A Conservative-by-
               Construction Language Model and the Shared-Potential
               Separator, with a Correspondence to Joint Embedding
               Predictive Architectures},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.19712427},
  url       = {https://doi.org/10.5281/zenodo.19712427},
  note      = {Version v15 (Jun 7, 2026).
               Companion code repository (DOI 10.5281/zenodo.20579561):
               \url{https://github.com/dimitarpg13/semsimula-paper}}
}

Environmental Impact

Hardware: NVIDIA A100 40GB (Google Colab)
Training time: ~3 hours (16,000 steps, A2 arm)
Carbon footprint: Estimated less than 2 kg CO2

Downloads last month: 70

Dataset used to train dimitarpg13/semsimula-fock-parflm-structured-vtheta

Collection including dimitarpg13/semsimula-fock-parflm-structured-vtheta

Semantic Simulation — SPLM Model Family

Collection

Conservative language models based on Lagrangian mechanics. Paper: https://doi.org/10.5281/zenodo.19712427 • 8 items • Updated 6 days ago

Evaluation results

Validation Perplexity (SQ3 K=8) on TinyStories
validation set self-reported

10.360

dimitarpg13
/

semsimula-fock-parflm-structured-vtheta