- Fock-PARFLM v2.1 with Structured V_theta (SQ3 Mixture of Quadratic Wells)
Fock-PARFLM v2.1 with Structured V_theta (SQ3 Mixture of Quadratic Wells)
A structured variant of the Fock-PARFLM v2.1 conservative language model in which the MLP scalar potential is replaced by a mixture of K=8 diagonal quadratic wells (SQ3), while retaining the full MLP-based pairwise potential and the Fock register mechanism (16 registers, LIFO stack discipline, reverse channel). This replacement yields:
- Full analytical gradients for — no
torch.autograd.gradneeded for the scalar potential force, eliminating the second-order computation graph for the component - Explicit attractor centres — the 8 semantic attractors are readable directly from the model parameters, with no gradient-descent extraction required
- Most compressed landscape — with both and Fock registers carrying the force budget, collapses to the flattest bias field observed across all three architectures (mean 0.008, range 19.1)
The trade-off is a 1.06 PPL gap: 10.36 PPL vs the MLP baseline's 9.30 PPL (11.4% excess cross-entropy). This is larger than the PARFLM gap (0.17 PPL) despite even greater landscape compression — the "Fock paradox" discussed below.
This model is from the Semantic Simulation framework.
Table of Contents
- When to Use This Model
- Architecture
- The Fock Paradox: Maximal Compression, Moderate Gap
- How to Get Started
- Training Details
- Evaluation Results
- SPLM Family Overview
- Bias, Risks, and Limitations
- Citation
When to Use This Model
Choose this structured variant over the MLP-based Fock-PARFLM v2.1 when:
| Priority | Structured V_theta (this model) | MLP V_theta (baseline) |
|---|---|---|
| Interpretability | 8 explicit attractor centres, zero-cost basin readout | Black-box; requires 1,500-step GD extraction per prompt |
| Inference speed | ~2x faster V_theta force computation (analytical gradient) | Standard (autograd for both V_theta and V_phi) |
| Raw PPL | 10.36 | 9.30 |
| PPL gap | 1.06 PPL (11.4%) | — |
| Memory | No second-order graph for V_theta (V_phi graph retained) | Full graph for both |
Bottom line: for Fock-PARFLM, structured is a viable option when interpretability or analytical-gradient inference is valued — the 1.06 PPL cost is meaningful but the attractor readout and speed gains may justify it. For maximum PPL, use the MLP baseline.
Architecture
Input tokens x_1, ..., x_T
|
Embedding E[x] + positional encoding
|
For each of L=8 integration steps:
|
+-- K-EMA channels: xi^(k)_t = causal_ema(h, alpha_k) [K_xi=4 channels]
|
+-- Structured V_theta (SQ3):
| xi_flat = flatten(xi_1..xi_K) [K_xi * d = 1024]
| V = -tau * logsumexp_k(-E_k/tau + log pi_k) [K_mix=8 wells]
| f_theta = -analytical_grad_h V [closed-form]
|
+-- Pairwise V_phi (competitive structural MLP):
| scores = score_net(h_t, h_s) [for all s <= t]
| top-k selection via Gumbel-softmax [k=8 neighbours]
| f_phi = -grad_h V_phi(h_t, h_s) [autograd, sparse]
|
+-- Fock register pool (v2.1):
| M=16 virtual registers with Q/K/V creation gates
| LIFO stack discipline, salience decay
| Per-register tau and key subspaces
| Reverse channel (non-conservative exchange)
| f_fock = creation + destruction + exchange forces
|
+-- Total force: f = f_theta + f_phi + f_fock
|
+-- Damped Euler step: v += dt*f/m; v /= (1 + dt*gamma); h += dt*v
|
+-- LayerNorm(h)
|
Logits = h @ E^T [tied embeddings]
| Parameter | Value |
|---|---|
| Hidden dim (d) | 256 |
| Layers (L) | 8 |
| V_theta kind | SQ3 (mixture of K quadratic wells) |
| Mixture components (K_mix) | 8 |
| Temperature (tau) | 1.0 |
| Xi channels (K_xi) | 4 |
| V_phi kind | structural_competitive |
| V_phi hidden | 128 |
| Top-k (sparse routing) | 8 |
| Gumbel tau | 1.0 (init), 0.3 (min) |
| Fock version | v2.1 |
| Registers (M) | 16 |
| Register d_k | 64 |
| Stack discipline | LIFO |
| Reverse channel | Yes |
| Per-register tau/keys | Yes |
| Gathered V_phi | Yes |
| Per-layer V_phi scale | Yes |
| LN before distance | Yes |
| Layer checkpoint | Yes |
| Mass model | logfreq (frozen surprisal lookup) |
| Damping gamma | 0.30 (fixed) |
| lambda_V (V_theta regularisation) | 0.01 |
| Total parameters | 17,407,980 |
The Fock Paradox: Maximal Compression, Moderate Gap
Across the three SPLM-family architectures tested with structured , the Fock-PARFLM exhibits a striking paradox: the flattest landscape yet the second-largest expressivity gap.
| Architecture | Structured V_theta PPL | MLP baseline PPL | Gap | Gap (%) | Mean V_theta | Range |
|---|---|---|---|---|---|---|
| Multi-Xi SPLM (SQ3) | 13.33 | 11.51 | 1.82 | 5.5% | 99.8 | 644.9 |
| Fock-PARFLM v2.1 (SQ3, this model) | 10.36 | 9.30 | 1.06 | 11.4% | 0.008 | 19.1 |
| Multi-Xi PARFLM (SQ3) | 12.27 | 12.10 | 0.17 | 0.6% | 0.02 | 31.4 |
The landscape compression is monotonic (SPLM > PARFLM > Fock-PARFLM), but the expressivity gap is non-monotonic: PARFLM achieves the smallest gap despite a less compressed landscape than Fock-PARFLM.
Why? At 9.30 PPL, the Fock model operates closer to the dataset's entropy floor, where the marginal value of each nat of precision is higher. The Fock register mechanism (creation/destruction operators, stack discipline, reverse channel) creates a more structured dynamical regime where even a near-flat must provide fine-grained corrections that the diagonal quadratic parameterisation cannot match. The relationship between landscape compression and expressivity gap is therefore architecture-dependent, modulated by proximity to the entropy floor.
The force from the structured is computed in closed form:
where are the softmax responsibilities over the 8 quadratic wells. The force still uses autograd (sparse, over top-k=8 neighbours only), and the Fock forces use their own differentiable computation.
For full derivations, all four structured variants (SQ1--SQ4), landscape compression analysis, attractor basin decoding, and hyperparameter selection strategies, see the companion note: Structured_VTheta_Design_and_Theory.md.
How to Get Started
import torch, sys
sys.path.insert(0, "multixi")
sys.path.insert(0, "parf")
sys.path.insert(0, "energetic_minima")
sys.path.insert(0, "sarf_mass_variant")
from parf.model_fock_parf_multixi import FockMultiXiPARFLM, FockMultiXiPARFConfig
from parf.model_structured_vtheta import MixtureQuadraticVTheta
from parf.model_structured_vtheta_multixi import StructuredVThetaMultiXiAdapter
# -- Build base model --
config = FockMultiXiPARFConfig(
vocab_size=50257, d=256, n_layers=8,
v_hidden=1024, v_depth=3,
max_len=1024, block_size=512,
gamma=0.30, xi_channels=4,
xi_alpha_inits=[0.25, 0.5, 0.75, 0.95],
xi_learnable=True, mass_mode="logfreq",
logfreq_path="logfreq_surprisal_tinystories.npy",
v_phi_kind="structural_competitive",
v_phi_phi_hidden=128, v_phi_theta_hidden=128,
top_k=8, score_head_hidden=32,
gumbel_tau_init=1.0, gumbel_tau_min=0.3,
gumbel_noise=True,
use_gathered_v_phi=True,
use_layer_checkpoint=True,
ln_before_distance=True,
per_layer_v_phi_scale=True,
n_registers=16, register_d_k=64,
register_tau_create_init=8.0,
register_salience_decay=0.5,
register_salience_threshold=0.01,
register_stack_discipline="lifo",
use_reverse_channel=True,
)
model = FockMultiXiPARFLM(config)
# -- Swap in structured V_theta --
K_xi, d = 4, 256
inner = MixtureQuadraticVTheta(d=K_xi * d, K=8, tau=1.0)
model.V_theta = StructuredVThetaMultiXiAdapter(inner, K=K_xi, d=d)
# -- Load checkpoint --
from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download(
repo_id="dimitarpg13/semsimula-fock-parflm-structured-vtheta",
filename="checkpoint/ckpt_best.pt",
)
state = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(state["model_state_dict"])
model.eval()
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
# -- Read attractor centres directly --
x = torch.randint(0, 50257, (1, 64))
with torch.no_grad():
h = model._embed(x)
xis = model._compute_xis(h) # (1, 64, 4, 256)
centres = model.V_theta.attractor_centres(xis) # (1, 64, 8, 1024)
print(f"Attractor centres shape: {centres.shape}") # 8 basins per token
Available Artifacts
| File | Description |
|---|---|
checkpoint/ckpt_best.pt |
Best checkpoint (A2 arm, 10.36 PPL at step 14,400) |
training_log.jsonl |
Per-step training metrics (40 eval points) |
training_curve_A2.png |
Training/validation loss curves |
v_theta_hist_A2.png |
V_theta output distribution histogram |
landscape_stats_A2.json |
V_theta landscape statistics (mean, std, range) |
model_structured_vtheta.py |
Structured V_theta classes (SQ1--SQ4) |
model_structured_vtheta_multixi.py |
Multi-Xi adapter |
config.json |
Model configuration |
Training Details
Training Data
TinyStories --- a synthetic corpus of short children's stories generated by GPT-3.5/4, tokenized with GPT-2 BPE (vocab size 50,257). Training cap: 5M tokens; validation: ~140k tokens.
Training Procedure
The base model architecture is identical to the Fock-PARFLM v2.1. The only modification is the replacement: the 3-layer MLP is swapped for the SQ3 mixture at model construction time, before training begins from scratch. The pairwise (competitive structural MLP, hidden=128, top-k=8) and Fock register pool (16 registers, LIFO, reverse channel) are unchanged.
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate | 5e-4 (cosine decay) |
| Warmup steps | 400 |
| Weight decay | 0.01 |
| Gradient clipping | 1.0 |
| Batch size | 16 |
| Block size | 512 |
| Training steps | 16,000 |
| lambda_V (V_theta regularisation) | 0.01 |
| Hardware | NVIDIA A100 40GB (Google Colab) |
Training Script
notebooks/conservative_arch/scaleup/colab_fock_multixi_structured_vtheta.ipynb --- Colab notebook with structured V_theta arms, GDrive output, checkpointing, and live progress display.
Evaluation Results
TinyStories Validation Perplexity
| Model | PPL | Params | Analytical V_theta grad | V_theta--MLP gap |
|---|---|---|---|---|
| Matched Attention (baseline) | 7.81 | 19.5M | --- | --- |
| Fock-PARFLM v2.1 (MLP) | 9.30 | 17.4M | No | --- |
| Fock-PARFLM v2.1 (SQ3, this model) | 10.36 | 17.4M | Yes | 1.06 PPL |
| Multi-Xi SPLM (MLP) | 11.51 | 16.5M | No | --- |
| Multi-Xi PARFLM (MLP) | 12.06 | 17.6M | No | --- |
| Multi-Xi PARFLM (SQ3) | 12.27 | 17.3M | Yes | 0.17 PPL |
| Multi-Xi SPLM (SQ3) | 13.33 | 17.3M | Yes | 1.82 PPL |
V_theta Landscape Statistics
| Metric | This model (Fock-PARFLM) | PARFLM structured V_theta | SPLM structured V_theta |
|---|---|---|---|
| Mean V_theta | 0.008 | 0.02 | 99.8 |
| Std V_theta | 0.26 | 0.51 | 26.3 |
| Range | 19.1 | 31.4 | 644.9 |
Learned Xi-Channel Decay Rates
The final learned alpha values are stable and consistent with the SPLM and PARFLM values, confirming that the causal EMA context structure is invariant to the parameterisation, the presence of , and the Fock dynamics.
SPLM Family Overview
This model is part of the Semantic Simulation SPLM family:
| Model | Design | PPL | HuggingFace |
|---|---|---|---|
| Multi-Xi SPLM (MLP) | Pure scalar potential | 11.51 | semsimula-splm-multixi |
| Multi-Xi SPLM (SQ3) | Structured scalar potential | 13.33 | semsimula-splm-multixi-structured-vtheta |
| Multi-Xi PARFLM (MLP) | Scalar + pairwise forces | 12.06 | semsimula-parflm-multixi |
| Multi-Xi PARFLM (SQ3) | Structured scalar + pairwise | 12.27 | semsimula-parflm-multixi-structured-vtheta |
| Fock-PARFLM v2.1 (MLP) | PARFLM + Fock registers | 9.30 | semsimula-fock-parflm |
| Fock-PARFLM v2.1 (SQ3) | Structured + pairwise + Fock | 10.36 | this model |
| Hybrid SPLM+Attn | Attention + SPLM refinement | 8.50 | semsimula-hybrid-splm |
Collection: Semantic Simulation SPLM Model Family
Bias, Risks, and Limitations
- Research checkpoint only. This model is a proof-of-concept for structured scalar potentials in Fock-augmented architectures, not a production system.
- TinyStories only. Trained exclusively on synthetic children's stories (~5M tokens). Not suitable for general-purpose language generation.
- English only. No multilingual capability.
- Small scale. 17.4M parameters, 256-dim hidden states.
- No safety training. No RLHF, DPO, or safety filtering has been applied.
- V_phi and Fock forces still use autograd. Only the gradient is analytical; the pairwise force and Fock register forces still require
torch.autograd.grad. The overall training speedup is therefore partial (\(V_\phi\) and Fock forces dominate the cost). - Moderate expressivity gap. The 1.06 PPL gap (11.4% excess CE) is larger than the PARFLM structured V_theta gap (0.17 PPL), reflecting the higher precision demands of the Fock architecture near the dataset's entropy floor.
Citation
@misc{Gueorguiev2026SemSim,
author = {Gueorguiev, Dimitar P.},
title = {Semantic Simulation: A Prescriptive Lagrangian Framework
for Efficient Semantic Inference --- A Conservative-by-
Construction Language Model and the Shared-Potential
Separator, with a Correspondence to Joint Embedding
Predictive Architectures},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.19712427},
url = {https://doi.org/10.5281/zenodo.19712427},
note = {Version v15 (Jun 7, 2026).
Companion code repository (DOI 10.5281/zenodo.20579561):
\url{https://github.com/dimitarpg13/semsimula-paper}}
}
Environmental Impact
- Hardware: NVIDIA A100 40GB (Google Colab)
- Training time: ~3 hours (16,000 steps, A2 arm)
- Carbon footprint: Estimated less than 2 kg CO2
- Downloads last month
- 70
Dataset used to train dimitarpg13/semsimula-fock-parflm-structured-vtheta
Collection including dimitarpg13/semsimula-fock-parflm-structured-vtheta
Evaluation results
- Validation Perplexity (SQ3 K=8) on TinyStoriesvalidation set self-reported10.360