Fock Attention PARFLM (Direct Token-to-Token Exchange Force Language Model)

The Fock Attention PARFLM implements the Section 5.1 Feynman diagram as a literal non-conservative force: each token emits a virtual photon carrying a key and payload, and each token absorbs with a query. The exchange coupling $\alpha_{ij} = \text{softmax}_j(q_i \cdot k_j / \sqrt{d_k})$ and the force $F_i = \sum_j \alpha_{ij} \cdot v_j$ . This is the $\lambda = 0$ (instantaneous exchange) limit of the Fock mechanism -- no registers, no persistence, no creation/destruction gates.

This "Route 2" across the Conservative Obstruction is the $O (T^{2})$ counterpart to the Fock register pool's $O (1)$ "Route 1". It achieves 9.42 PPL on TinyStories -- slightly behind the register-based Fock-PARFLM v2.1 (9.30), confirming that register persistence provides a small but consistent advantage once routing is fixed.

Part of the Semantic Simulation framework.

Model Details
Architecture
Why Not a Transformer?
Geometric Capabilities
How to Get Started
Training Details
Evaluation Results
SPLM Family Overview
Bias, Risks, and Limitations
Citation
Environmental Impact

Model Details

Model Description

The Fock Attention PARFLM extends the Multi-Xi PARFLM with a direct token-to-token exchange force inspired by Feynman's virtual particle exchange diagram. Unlike the register-based Fock-PARFLM which uses persistent auxiliary state, this variant implements instantaneous exchange:

Token $j$ emits: key $k_{j} = W_{K} h_{j}$ , payload $v_{j} = W_{V} h_{j}$
Token $i$ absorbs: query $q_{i} = W_{Q} h_{i}$
Coupling: $\alpha_{ij} = \text{softmax}_j(q_i \cdot k_j / \sqrt{d_k})$
Exchange force on token i: $F_i = \sum_j \alpha_{ij} \cdot v_j$

The force is injected post-Verlet step as a non-conservative addition:

$h \mathrel{+}= \frac{\Delta t^2}{m} \cdot \tanh(s_{\text{ex}}) \cdot F_{\text{exchange}}$

The learned exchange scale $\tanh(s_{\text{ex}})$ converges to approximately -0.32 (repulsive), meaning the exchange force pushes tokens apart in hidden-state space rather than attracting them.

Developed by: Dimitar P. Gueorguiev (Independent Researcher)
Model type: Conservative autoregressive LM with direct exchange force
Language: English
License: CC-BY-4.0

Model Sources

Paper: Semantic Simulation: A Prescriptive Lagrangian Framework for Efficient Semantic Inference (Section 17c)
Repository: github.com/dimitarpg13/semsimula-paper
Model source code: notebooks/conservative_arch/parf/model_fock_attention.py

Architecture

Input tokens x_1, ..., x_T
       |
   Embedding E[x] + positional encoding
       |
   For each of L=8 integration steps:
       |
       +-- K-EMA channels: xi^(k)_t = causal_ema(h, alpha_k)   [K=4 channels]
       |
       +-- Single-body: V_theta([xi_1..xi_K, h]) -> R           [3-layer MLP]
       |
       +-- Sparse PARF: V_phi(h_t, h_s)                         [top-k=8 routing]
       |
       +-- Conservative force: f = -grad_h(V_theta + V_phi)     [autograd]
       |
       +-- Damped Euler step: v += dt*f/m; v /= (1+dt*gamma); h += dt*v
       |
       +-- Direct exchange force (Feynman diagram):
       |     q_i = W_Q * h_t     [4 heads, d_k=32]
       |     k_j = W_K * h_s
       |     v_j = W_V * h_s
       |     alpha_ij = softmax_j(q_i . k_j / sqrt(d_k))        [causal mask]
       |     F_i = sum_j alpha_ij * v_j
       |     h += (dt^2/m) * tanh(s_ex) * F_i                   [non-conservative]
       |
       +-- LayerNorm(h)
       |
   Logits = h @ E^T                                            [tied embeddings]

Parameter	Value
Hidden dim (d)	256
Layers (L)	8
$V_\theta$ hidden / depth	1024 / 3
Xi channels (K)	4
$V_\phi$ kind	structural_competitive
$V_\phi$ hidden (H)	128
Sparse routing top_k	8
Exchange heads	4
Exchange d_k	32
Learned exchange scale	$\tanh(s_{\text{ex}})$ ~ -0.32 (repulsive)
Mass model	logfreq
Damping $\gamma$	0.30 (fixed)
Total parameters	16,714,708

Key Design Properties

Two routes across the obstruction: This model takes "Route 2" (direct exchange, $O (T^{2})$ ) while the Fock-PARFLM takes "Route 1" (register-mediated, $O (1)$ ). Both achieve comparable PPL.
Repulsive exchange: The learned scale is negative, meaning the exchange force diversifies token representations rather than collapsing them.
Minimal overhead: Only ~131K parameters for the exchange mechanism (Q/K/V projections + scale scalar).
Otherwise conservative: The core dynamics $V_\theta + V_\phi$ remain fully conservative; the exchange force is the only non-conservative component.

Why Not a Transformer?

The Fock Attention PARFLM is not based on the Transformer architecture. There are no Transformer-style FFN towers. The conservative core is driven by two small scalar-potential MLPs — $V_\theta$ (~~3.4M params) and $V_\phi$ (~~19K params). The direct exchange force adds only ~131K parameters (Q/K/V projections).

Key structural differences from Transformers:

Property	Transformer (GPT-2 small)	Fock Attention (this model)
Architecture	Self-attention + FFN blocks	Scalar-potential gradient flow + direct exchange
Core computation	50.3M (MLP) + 28.3M (attention)	3.4M $V_\theta$ + 19K $V_\phi$ + 131K (exchange)
Runtime state per token	$O (T)$ — KV-cache grows linearly	$O (T)$ — exchange force is $O (T^{2})$
Total parameters	124M	16.7M
Pairwise token interaction	$O (T^{2})$ dense attention	$O (T^{2})$ direct exchange + $O (T k)$ sparse PARF

Note: Because this model uses a direct token-to-token exchange force (the "Route 2" across the Conservative Obstruction), its runtime cost is $O (T^{2})$ like attention. For fully $O (1)$ inference with comparable PPL, see the register-based Fock-PARFLM v2.1 ("Route 1"), which achieves 9.30 PPL vs this model's 9.42 PPL.

Geometric Capabilities

Note: This model uses a direct $O (T^{2})$ exchange force that is non-conservative, breaking the full Riemannian guarantee. The full Riemannian geometry — Jacobi metric, computable geodesics, curvature-based hallucination detection, native chain-of-thought — is available only in the purely conservative variants: Multi-Xi SPLM, Multi-Xi PARFLM, and Fock-PARFLM v2.1. See Section 18d and Section 23 of the paper for details.

Damped Riemannian Geometry (June 2026 update)

A Riemannian Geometry Diagnostic Battery run on all three SPLM-family checkpoints confirmed that while the force field $f = -\nabla V_\theta$ is conservative, the full dynamics are dominated by the damping term $\gamma$ in the integrator. Key findings:

Metric validity (Arm 1): The layer-dependent conformal factor $\Omega^2 = 2T_\ell \cdot m > 0$ at 100% of positions — the Riemannian metric is well-defined everywhere.
Geodesic compliance (Arm 2): Undamped geodesics predict the wrong direction (compliance $\approx -0.4$ ); the damped geodesic equation (with friction $-\gamma \dot{\gamma}^k$ ) is required.
Energy dissipation (Arm 4): Energy decays monotonically across layers (SPLM: 172% drift), consistent with the designed damping — not a numerical artefact.
Asymmetry (Arm 5): The layer map is strongly asymmetric ($R^2_{\text{sym}} \ll 0$), expected for damped dynamics + LayerNorm.

The theoretical framework has been updated from undamped Maupertuis-Jacobi to damped Riemannian geometry with a contact Hamiltonian interpretation. See the companion note for full details.

How to Get Started

# Clone the companion repository for full source code
# git clone https://github.com/dimitarpg13/semsimula-paper.git
# cd semsimula-paper/notebooks/conservative_arch

import torch
import sys
sys.path.insert(0, "parf")
sys.path.insert(0, "multixi")
sys.path.insert(0, "energetic_minima")
sys.path.insert(0, "sarf_mass_variant")

from parf.model_fock_attention import FockAttentionPARFLM, FockAttentionConfig

config = FockAttentionConfig(
    vocab_size=50257,
    d=256,
    n_layers=8,
    v_hidden=1024,
    v_depth=3,
    max_len=1024,
    block_size=512,
    gamma=0.30,
    xi_channels=4,
    v_phi_kind="structural_competitive",
    v_phi_hidden=128,
    top_k=8,
    exchange_n_heads=4,
    exchange_d_k=32,
)

model = FockAttentionPARFLM(config)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

# Forward pass
x = torch.randint(0, 50257, (1, 64))
logits, loss = model(x, targets=x)

Available Checkpoint

A trained checkpoint (PPL 9.42, 16k steps) is included in this repository:

File	Description
`checkpoint/model.pt`	Full model state dict (64 MB)
`training_log.jsonl`	Per-step training metrics
`loss_curve.png`	Training/validation loss plot
`training_summary.md`	Hyperparameters and final metrics

To load the checkpoint:

from huggingface_hub import hf_hub_download
import torch

# Download checkpoint
ckpt_path = hf_hub_download(
    repo_id="dimitarpg13/semsimula-fock-attention",
    filename="checkpoint/model.pt",
)

# Load into model (after creating model as above)
state = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(state["model_state_dict"])
model.eval()

Training Details

Training Data

TinyStories -- GPT-2 BPE tokenization. Training cap: 5M tokens.

Training Procedure

Hyperparameter	Value
Optimizer	AdamW
Learning rate	5e-4 (cosine decay)
Warmup steps	400
Weight decay	0.01
Gradient clipping	1.0
Batch size	16
Block size	512
Training steps	16,000
Memory optimisation	Level-2 grad checkpoint
Hardware	A100/H100 (Google Colab)

Training Script

notebooks/conservative_arch/scaleup/train_fock_attention_scaleup.py

Colab Notebook

notebooks/conservative_arch/scaleup/colab_fock_attention_h128.ipynb — 4-arm sweep over head count and schedule with live progress display, saves results to Google Drive.

Training Results

notebooks/conservative_arch/scaleup/results/semsimula_fock_attention_h128/ — training logs, loss curves, and experiment report.

Head / d_k Sweep

Arm	Heads	d_k	Steps	PPL
direct_K4_h1_8k	1	64	8k	11.48
direct_K4_h4_8k	4	32	8k	10.93
direct_K8_h4_8k	8	32	8k	11.65
direct_K4_h4_16k	4	32	16k	9.42

Evaluation Results

TinyStories Validation Perplexity

Model	PPL	Params	Gap vs Attention
Matched Attention (baseline)	7.81	19.5M	--
Hybrid SPLM+Attn	8.50	~19.0M	+0.69
Fock-PARFLM v2.1	9.30	17.4M	+1.49
Fock Attention (this model)	9.42	16.7M	+1.61
Multi-Xi PARFLM	12.06	17.6M	+4.25
Multi-Xi SPLM	11.51	16.5M	+3.70

The Fock Attention is 0.12 PPL behind the register-based Fock-PARFLM v2.1, confirming that register persistence provides a small advantage. However, Fock Attention has the smallest parameter count in the family (16.7M) and the simplest Fock mechanism (no creation/destruction gates, no stack discipline).

SPLM Family Overview

This model is part of the Semantic Simulation SPLM family:

Model	Design	Inference	HuggingFace
Multi-Xi SPLM	Pure scalar potential, K-EMA context	$O (1)$	semsimula-splm-multixi
Hybrid SPLM+Attn	Attention front-end + SPLM refinement	$O (T)$	semsimula-hybrid-splm
Multi-Xi PARFLM	Scalar potential + sparse pairwise forces	$O (1)$	semsimula-parflm-multixi
Fock-PARFLM v2.1	PARFLM + Fock register pool (mediated exchange)	$O (1)$	semsimula-fock-parflm
Fock Attention	PARFLM + direct token-to-token exchange	$O (T^{2})$	this model

Bias, Risks, and Limitations

Research checkpoint only. Proof-of-concept for direct exchange force as a non-conservative complement to conservative dynamics.
TinyStories only. Trained exclusively on synthetic children's stories (~5M tokens).
English only. No multilingual capability.
Small scale. 16.7M parameters, 256-dim hidden states.
No safety training. No RLHF, DPO, or safety filtering has been applied.

Citation

@misc{Gueorguiev2026SemSim,
  author    = {Gueorguiev, Dimitar P.},
  title     = {Semantic Simulation: A Prescriptive Lagrangian Framework
               for Efficient Semantic Inference --- A Conservative-by-
               Construction Language Model and the Shared-Potential
               Separator, with a Correspondence to Joint Embedding
               Predictive Architectures},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.19712427},
  url       = {https://doi.org/10.5281/zenodo.19712427},
  note      = {Version v15 (Jun 7, 2026).
               Companion code repository (DOI 10.5281/zenodo.20579561):
               \url{https://github.com/dimitarpg13/semsimula-paper}}
}

Environmental Impact

Hardware: NVIDIA A100/H100 (Google Colab)
Training time: ~8 hours (16,000 steps with gradient checkpointing)
Carbon footprint: Estimated < 3 kg CO2

Downloads last month: 287

Dataset used to train dimitarpg13/semsimula-fock-attention

Collection including dimitarpg13/semsimula-fock-attention

Semantic Simulation — SPLM Model Family

Collection

Conservative language models based on Lagrangian mechanics. Paper: https://doi.org/10.5281/zenodo.19712427 • 8 items • Updated 2 days ago

Evaluation results

Validation Perplexity on TinyStories
validation set self-reported

9.420

dimitarpg13
/

semsimula-fock-attention