- Fock Attention PARFLM (Direct Token-to-Token Exchange Force Language Model)
Fock Attention PARFLM (Direct Token-to-Token Exchange Force Language Model)
The Fock Attention PARFLM implements the Section 5.1 Feynman diagram as a literal non-conservative force: each token emits a virtual photon carrying a key and payload, and each token absorbs with a query. The exchange coupling and the force . This is the (instantaneous exchange) limit of the Fock mechanism -- no registers, no persistence, no creation/destruction gates.
This "Route 2" across the Conservative Obstruction is the counterpart to the Fock register pool's "Route 1". It achieves 9.42 PPL on TinyStories -- slightly behind the register-based Fock-PARFLM v2.1 (9.30), confirming that register persistence provides a small but consistent advantage once routing is fixed.
Part of the Semantic Simulation framework.
Table of Contents
- Model Details
- Architecture
- Why Not a Transformer?
- Geometric Capabilities
- How to Get Started
- Training Details
- Evaluation Results
- SPLM Family Overview
- Bias, Risks, and Limitations
- Citation
- Environmental Impact
Model Details
Model Description
The Fock Attention PARFLM extends the Multi-Xi PARFLM with a direct token-to-token exchange force inspired by Feynman's virtual particle exchange diagram. Unlike the register-based Fock-PARFLM which uses persistent auxiliary state, this variant implements instantaneous exchange:
- Token emits: key , payload
- Token absorbs: query
- Coupling:
- Exchange force on token i:
The force is injected post-Verlet step as a non-conservative addition:
The learned exchange scale converges to approximately -0.32 (repulsive), meaning the exchange force pushes tokens apart in hidden-state space rather than attracting them.
- Developed by: Dimitar P. Gueorguiev (Independent Researcher)
- Model type: Conservative autoregressive LM with direct exchange force
- Language: English
- License: CC-BY-4.0
Model Sources
- Paper: Semantic Simulation: A Prescriptive Lagrangian Framework for Efficient Semantic Inference (Section 17c)
- Repository: github.com/dimitarpg13/semsimula-paper
- Model source code:
notebooks/conservative_arch/parf/model_fock_attention.py
Architecture
Input tokens x_1, ..., x_T
|
Embedding E[x] + positional encoding
|
For each of L=8 integration steps:
|
+-- K-EMA channels: xi^(k)_t = causal_ema(h, alpha_k) [K=4 channels]
|
+-- Single-body: V_theta([xi_1..xi_K, h]) -> R [3-layer MLP]
|
+-- Sparse PARF: V_phi(h_t, h_s) [top-k=8 routing]
|
+-- Conservative force: f = -grad_h(V_theta + V_phi) [autograd]
|
+-- Damped Euler step: v += dt*f/m; v /= (1+dt*gamma); h += dt*v
|
+-- Direct exchange force (Feynman diagram):
| q_i = W_Q * h_t [4 heads, d_k=32]
| k_j = W_K * h_s
| v_j = W_V * h_s
| alpha_ij = softmax_j(q_i . k_j / sqrt(d_k)) [causal mask]
| F_i = sum_j alpha_ij * v_j
| h += (dt^2/m) * tanh(s_ex) * F_i [non-conservative]
|
+-- LayerNorm(h)
|
Logits = h @ E^T [tied embeddings]
| Parameter | Value |
|---|---|
| Hidden dim (d) | 256 |
| Layers (L) | 8 |
| hidden / depth | 1024 / 3 |
| Xi channels (K) | 4 |
| kind | structural_competitive |
| hidden (H) | 128 |
| Sparse routing top_k | 8 |
| Exchange heads | 4 |
| Exchange d_k | 32 |
| Learned exchange scale | ~ -0.32 (repulsive) |
| Mass model | logfreq |
| Damping | 0.30 (fixed) |
| Total parameters | 16,714,708 |
Key Design Properties
- Two routes across the obstruction: This model takes "Route 2" (direct exchange, ) while the Fock-PARFLM takes "Route 1" (register-mediated, ). Both achieve comparable PPL.
- Repulsive exchange: The learned scale is negative, meaning the exchange force diversifies token representations rather than collapsing them.
- Minimal overhead: Only ~131K parameters for the exchange mechanism (Q/K/V projections + scale scalar).
- Otherwise conservative: The core dynamics remain fully conservative; the exchange force is the only non-conservative component.
Why Not a Transformer?
The Fock Attention PARFLM is not based on the Transformer architecture. There are no Transformer-style FFN towers. The conservative core is driven by two small scalar-potential MLPs — (3.4M params) and (19K params). The direct exchange force adds only ~131K parameters (Q/K/V projections).
Key structural differences from Transformers:
| Property | Transformer (GPT-2 small) | Fock Attention (this model) |
|---|---|---|
| Architecture | Self-attention + FFN blocks | Scalar-potential gradient flow + direct exchange |
| Core computation | 50.3M (MLP) + 28.3M (attention) | 3.4M + 19K + 131K (exchange) |
| Runtime state per token | — KV-cache grows linearly | — exchange force is |
| Total parameters | 124M | 16.7M |
| Pairwise token interaction | dense attention | direct exchange + sparse PARF |
Note: Because this model uses a direct token-to-token exchange force (the "Route 2" across the Conservative Obstruction), its runtime cost is like attention. For fully inference with comparable PPL, see the register-based Fock-PARFLM v2.1 ("Route 1"), which achieves 9.30 PPL vs this model's 9.42 PPL.
Geometric Capabilities
Note: This model uses a direct exchange force that is non-conservative, breaking the full Riemannian guarantee. The full Riemannian geometry — Jacobi metric, computable geodesics, curvature-based hallucination detection, native chain-of-thought — is available only in the purely conservative variants: Multi-Xi SPLM, Multi-Xi PARFLM, and Fock-PARFLM v2.1. See Section 18d and Section 23 of the paper for details.
Damped Riemannian Geometry (June 2026 update)
A Riemannian Geometry Diagnostic Battery run on all three SPLM-family checkpoints confirmed that while the force field is conservative, the full dynamics are dominated by the damping term in the integrator. Key findings:
- Metric validity (Arm 1): The layer-dependent conformal factor at 100% of positions — the Riemannian metric is well-defined everywhere.
- Geodesic compliance (Arm 2): Undamped geodesics predict the wrong direction (compliance ); the damped geodesic equation (with friction ) is required.
- Energy dissipation (Arm 4): Energy decays monotonically across layers (SPLM: 172% drift), consistent with the designed damping — not a numerical artefact.
- Asymmetry (Arm 5): The layer map is strongly asymmetric (\(R^2_{\text{sym}} \ll 0\)), expected for damped dynamics + LayerNorm.
The theoretical framework has been updated from undamped Maupertuis-Jacobi to damped Riemannian geometry with a contact Hamiltonian interpretation. See the companion note for full details.
How to Get Started
# Clone the companion repository for full source code
# git clone https://github.com/dimitarpg13/semsimula-paper.git
# cd semsimula-paper/notebooks/conservative_arch
import torch
import sys
sys.path.insert(0, "parf")
sys.path.insert(0, "multixi")
sys.path.insert(0, "energetic_minima")
sys.path.insert(0, "sarf_mass_variant")
from parf.model_fock_attention import FockAttentionPARFLM, FockAttentionConfig
config = FockAttentionConfig(
vocab_size=50257,
d=256,
n_layers=8,
v_hidden=1024,
v_depth=3,
max_len=1024,
block_size=512,
gamma=0.30,
xi_channels=4,
v_phi_kind="structural_competitive",
v_phi_hidden=128,
top_k=8,
exchange_n_heads=4,
exchange_d_k=32,
)
model = FockAttentionPARFLM(config)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
# Forward pass
x = torch.randint(0, 50257, (1, 64))
logits, loss = model(x, targets=x)
Available Checkpoint
A trained checkpoint (PPL 9.42, 16k steps) is included in this repository:
| File | Description |
|---|---|
checkpoint/model.pt |
Full model state dict (64 MB) |
training_log.jsonl |
Per-step training metrics |
loss_curve.png |
Training/validation loss plot |
training_summary.md |
Hyperparameters and final metrics |
To load the checkpoint:
from huggingface_hub import hf_hub_download
import torch
# Download checkpoint
ckpt_path = hf_hub_download(
repo_id="dimitarpg13/semsimula-fock-attention",
filename="checkpoint/model.pt",
)
# Load into model (after creating model as above)
state = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(state["model_state_dict"])
model.eval()
Training Details
Training Data
TinyStories -- GPT-2 BPE tokenization. Training cap: 5M tokens.
Training Procedure
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate | 5e-4 (cosine decay) |
| Warmup steps | 400 |
| Weight decay | 0.01 |
| Gradient clipping | 1.0 |
| Batch size | 16 |
| Block size | 512 |
| Training steps | 16,000 |
| Memory optimisation | Level-2 grad checkpoint |
| Hardware | A100/H100 (Google Colab) |
Training Script
notebooks/conservative_arch/scaleup/train_fock_attention_scaleup.py
Colab Notebook
notebooks/conservative_arch/scaleup/colab_fock_attention_h128.ipynb — 4-arm sweep over head count and schedule with live progress display, saves results to Google Drive.
Training Results
notebooks/conservative_arch/scaleup/results/semsimula_fock_attention_h128/ — training logs, loss curves, and experiment report.
Head / d_k Sweep
| Arm | Heads | d_k | Steps | PPL |
|---|---|---|---|---|
| direct_K4_h1_8k | 1 | 64 | 8k | 11.48 |
| direct_K4_h4_8k | 4 | 32 | 8k | 10.93 |
| direct_K8_h4_8k | 8 | 32 | 8k | 11.65 |
| direct_K4_h4_16k | 4 | 32 | 16k | 9.42 |
Evaluation Results
TinyStories Validation Perplexity
| Model | PPL | Params | Gap vs Attention |
|---|---|---|---|
| Matched Attention (baseline) | 7.81 | 19.5M | -- |
| Hybrid SPLM+Attn | 8.50 | ~19.0M | +0.69 |
| Fock-PARFLM v2.1 | 9.30 | 17.4M | +1.49 |
| Fock Attention (this model) | 9.42 | 16.7M | +1.61 |
| Multi-Xi PARFLM | 12.06 | 17.6M | +4.25 |
| Multi-Xi SPLM | 11.51 | 16.5M | +3.70 |
The Fock Attention is 0.12 PPL behind the register-based Fock-PARFLM v2.1, confirming that register persistence provides a small advantage. However, Fock Attention has the smallest parameter count in the family (16.7M) and the simplest Fock mechanism (no creation/destruction gates, no stack discipline).
SPLM Family Overview
This model is part of the Semantic Simulation SPLM family:
| Model | Design | Inference | HuggingFace |
|---|---|---|---|
| Multi-Xi SPLM | Pure scalar potential, K-EMA context | semsimula-splm-multixi | |
| Hybrid SPLM+Attn | Attention front-end + SPLM refinement | semsimula-hybrid-splm | |
| Multi-Xi PARFLM | Scalar potential + sparse pairwise forces | semsimula-parflm-multixi | |
| Fock-PARFLM v2.1 | PARFLM + Fock register pool (mediated exchange) | semsimula-fock-parflm | |
| Fock Attention | PARFLM + direct token-to-token exchange | this model |
Bias, Risks, and Limitations
- Research checkpoint only. Proof-of-concept for direct exchange force as a non-conservative complement to conservative dynamics.
- TinyStories only. Trained exclusively on synthetic children's stories (~5M tokens).
- English only. No multilingual capability.
- Small scale. 16.7M parameters, 256-dim hidden states.
- No safety training. No RLHF, DPO, or safety filtering has been applied.
Citation
@misc{Gueorguiev2026SemSim,
author = {Gueorguiev, Dimitar P.},
title = {Semantic Simulation: A Prescriptive Lagrangian Framework
for Efficient Semantic Inference --- A Conservative-by-
Construction Language Model and the Shared-Potential
Separator, with a Correspondence to Joint Embedding
Predictive Architectures},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.19712427},
url = {https://doi.org/10.5281/zenodo.19712427},
note = {Version v15 (Jun 7, 2026).
Companion code repository (DOI 10.5281/zenodo.20579561):
\url{https://github.com/dimitarpg13/semsimula-paper}}
}
Environmental Impact
- Hardware: NVIDIA A100/H100 (Google Colab)
- Training time: ~8 hours (16,000 steps with gradient checkpointing)
- Carbon footprint: Estimated < 3 kg CO2
- Downloads last month
- 287
Dataset used to train dimitarpg13/semsimula-fock-attention
Collection including dimitarpg13/semsimula-fock-attention
Evaluation results
- Validation Perplexity on TinyStoriesvalidation set self-reported9.420
