Fock Attention PARFLM (Direct Token-to-Token Exchange Force Language Model)

The Fock Attention PARFLM implements the Section 5.1 Feynman diagram as a literal non-conservative force: each token emits a virtual photon carrying a key and payload, and each token absorbs with a query. The exchange coupling αij=softmaxj(qikj/dk)\alpha_{ij} = \text{softmax}_j(q_i \cdot k_j / \sqrt{d_k}) and the force Fi=jαijvjF_i = \sum_j \alpha_{ij} \cdot v_j. This is the λ=0\lambda = 0 (instantaneous exchange) limit of the Fock mechanism -- no registers, no persistence, no creation/destruction gates.

This "Route 2" across the Conservative Obstruction is the O(T2)O(T^2) counterpart to the Fock register pool's O(1)O(1) "Route 1". It achieves 9.42 PPL on TinyStories -- slightly behind the register-based Fock-PARFLM v2.1 (9.30), confirming that register persistence provides a small but consistent advantage once routing is fixed.

Part of the Semantic Simulation framework.

Table of Contents

Model Details

Model Description

The Fock Attention PARFLM extends the Multi-Xi PARFLM with a direct token-to-token exchange force inspired by Feynman's virtual particle exchange diagram. Unlike the register-based Fock-PARFLM which uses persistent auxiliary state, this variant implements instantaneous exchange:

  1. Token jj emits: key kj=WKhjk_j = W_K h_j, payload vj=WVhjv_j = W_V h_j
  2. Token ii absorbs: query qi=WQhiq_i = W_Q h_i
  3. Coupling: αij=softmaxj(qikj/dk)\alpha_{ij} = \text{softmax}_j(q_i \cdot k_j / \sqrt{d_k})
  4. Exchange force on token i: Fi=jαijvjF_i = \sum_j \alpha_{ij} \cdot v_j

The force is injected post-Verlet step as a non-conservative addition:

h+=Δt2mtanh(sex)Fexchangeh \mathrel{+}= \frac{\Delta t^2}{m} \cdot \tanh(s_{\text{ex}}) \cdot F_{\text{exchange}}

The learned exchange scale tanh(sex)\tanh(s_{\text{ex}}) converges to approximately -0.32 (repulsive), meaning the exchange force pushes tokens apart in hidden-state space rather than attracting them.

  • Developed by: Dimitar P. Gueorguiev (Independent Researcher)
  • Model type: Conservative autoregressive LM with direct exchange force
  • Language: English
  • License: CC-BY-4.0

Model Sources

Architecture

Input tokens x_1, ..., x_T
       |
   Embedding E[x] + positional encoding
       |
   For each of L=8 integration steps:
       |
       +-- K-EMA channels: xi^(k)_t = causal_ema(h, alpha_k)   [K=4 channels]
       |
       +-- Single-body: V_theta([xi_1..xi_K, h]) -> R           [3-layer MLP]
       |
       +-- Sparse PARF: V_phi(h_t, h_s)                         [top-k=8 routing]
       |
       +-- Conservative force: f = -grad_h(V_theta + V_phi)     [autograd]
       |
       +-- Damped Euler step: v += dt*f/m; v /= (1+dt*gamma); h += dt*v
       |
       +-- Direct exchange force (Feynman diagram):
       |     q_i = W_Q * h_t     [4 heads, d_k=32]
       |     k_j = W_K * h_s
       |     v_j = W_V * h_s
       |     alpha_ij = softmax_j(q_i . k_j / sqrt(d_k))        [causal mask]
       |     F_i = sum_j alpha_ij * v_j
       |     h += (dt^2/m) * tanh(s_ex) * F_i                   [non-conservative]
       |
       +-- LayerNorm(h)
       |
   Logits = h @ E^T                                            [tied embeddings]
Parameter Value
Hidden dim (d) 256
Layers (L) 8
VθV_\theta hidden / depth 1024 / 3
Xi channels (K) 4
VϕV_\phi kind structural_competitive
VϕV_\phi hidden (H) 128
Sparse routing top_k 8
Exchange heads 4
Exchange d_k 32
Learned exchange scale tanh(sex)\tanh(s_{\text{ex}}) ~ -0.32 (repulsive)
Mass model logfreq
Damping γ\gamma 0.30 (fixed)
Total parameters 16,714,708

Key Design Properties

  • Two routes across the obstruction: This model takes "Route 2" (direct exchange, O(T2)O(T^2)) while the Fock-PARFLM takes "Route 1" (register-mediated, O(1)O(1)). Both achieve comparable PPL.
  • Repulsive exchange: The learned scale is negative, meaning the exchange force diversifies token representations rather than collapsing them.
  • Minimal overhead: Only ~131K parameters for the exchange mechanism (Q/K/V projections + scale scalar).
  • Otherwise conservative: The core dynamics Vθ+VϕV_\theta + V_\phi remain fully conservative; the exchange force is the only non-conservative component.

Why Not a Transformer?

The Fock Attention PARFLM is not based on the Transformer architecture. There are no Transformer-style FFN towers. The conservative core is driven by two small scalar-potential MLPs — VθV_\theta (3.4M params) and VϕV_\phi (19K params). The direct exchange force adds only ~131K parameters (Q/K/V projections).

Key structural differences from Transformers:

Property Transformer (GPT-2 small) Fock Attention (this model)
Architecture Self-attention + FFN blocks Scalar-potential gradient flow + direct exchange
Core computation 50.3M (MLP) + 28.3M (attention) 3.4M VθV_\theta + 19K VϕV_\phi + 131K (exchange)
Runtime state per token O(T)O(T) — KV-cache grows linearly O(T)O(T) — exchange force is O(T2)O(T^2)
Total parameters 124M 16.7M
Pairwise token interaction O(T2)O(T^2) dense attention O(T2)O(T^2) direct exchange + O(Tk)O(Tk) sparse PARF

Note: Because this model uses a direct token-to-token exchange force (the "Route 2" across the Conservative Obstruction), its runtime cost is O(T2)O(T^2) like attention. For fully O(1)O(1) inference with comparable PPL, see the register-based Fock-PARFLM v2.1 ("Route 1"), which achieves 9.30 PPL vs this model's 9.42 PPL.

Runtime information capacity vs sequence length

Geometric Capabilities

Note: This model uses a direct O(T2)O(T^2) exchange force that is non-conservative, breaking the full Riemannian guarantee. The full Riemannian geometry — Jacobi metric, computable geodesics, curvature-based hallucination detection, native chain-of-thought — is available only in the purely conservative variants: Multi-Xi SPLM, Multi-Xi PARFLM, and Fock-PARFLM v2.1. See Section 18d and Section 23 of the paper for details.

Damped Riemannian Geometry (June 2026 update)

A Riemannian Geometry Diagnostic Battery run on all three SPLM-family checkpoints confirmed that while the force field f=Vθf = -\nabla V_\theta is conservative, the full dynamics are dominated by the damping term γ\gamma in the integrator. Key findings:

  • Metric validity (Arm 1): The layer-dependent conformal factor Ω2=2Tm>0\Omega^2 = 2T_\ell \cdot m > 0 at 100% of positions — the Riemannian metric is well-defined everywhere.
  • Geodesic compliance (Arm 2): Undamped geodesics predict the wrong direction (compliance 0.4\approx -0.4); the damped geodesic equation (with friction γγ˙k-\gamma \dot{\gamma}^k) is required.
  • Energy dissipation (Arm 4): Energy decays monotonically across layers (SPLM: 172% drift), consistent with the designed damping — not a numerical artefact.
  • Asymmetry (Arm 5): The layer map is strongly asymmetric (\(R^2_{\text{sym}} \ll 0\)), expected for damped dynamics + LayerNorm.

The theoretical framework has been updated from undamped Maupertuis-Jacobi to damped Riemannian geometry with a contact Hamiltonian interpretation. See the companion note for full details.

How to Get Started

# Clone the companion repository for full source code
# git clone https://github.com/dimitarpg13/semsimula-paper.git
# cd semsimula-paper/notebooks/conservative_arch

import torch
import sys
sys.path.insert(0, "parf")
sys.path.insert(0, "multixi")
sys.path.insert(0, "energetic_minima")
sys.path.insert(0, "sarf_mass_variant")

from parf.model_fock_attention import FockAttentionPARFLM, FockAttentionConfig

config = FockAttentionConfig(
    vocab_size=50257,
    d=256,
    n_layers=8,
    v_hidden=1024,
    v_depth=3,
    max_len=1024,
    block_size=512,
    gamma=0.30,
    xi_channels=4,
    v_phi_kind="structural_competitive",
    v_phi_hidden=128,
    top_k=8,
    exchange_n_heads=4,
    exchange_d_k=32,
)

model = FockAttentionPARFLM(config)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

# Forward pass
x = torch.randint(0, 50257, (1, 64))
logits, loss = model(x, targets=x)

Available Checkpoint

A trained checkpoint (PPL 9.42, 16k steps) is included in this repository:

File Description
checkpoint/model.pt Full model state dict (64 MB)
training_log.jsonl Per-step training metrics
loss_curve.png Training/validation loss plot
training_summary.md Hyperparameters and final metrics

To load the checkpoint:

from huggingface_hub import hf_hub_download
import torch

# Download checkpoint
ckpt_path = hf_hub_download(
    repo_id="dimitarpg13/semsimula-fock-attention",
    filename="checkpoint/model.pt",
)

# Load into model (after creating model as above)
state = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(state["model_state_dict"])
model.eval()

Training Details

Training Data

TinyStories -- GPT-2 BPE tokenization. Training cap: 5M tokens.

Training Procedure

Hyperparameter Value
Optimizer AdamW
Learning rate 5e-4 (cosine decay)
Warmup steps 400
Weight decay 0.01
Gradient clipping 1.0
Batch size 16
Block size 512
Training steps 16,000
Memory optimisation Level-2 grad checkpoint
Hardware A100/H100 (Google Colab)

Training Script

notebooks/conservative_arch/scaleup/train_fock_attention_scaleup.py

Colab Notebook

notebooks/conservative_arch/scaleup/colab_fock_attention_h128.ipynb — 4-arm sweep over head count and schedule with live progress display, saves results to Google Drive.

Training Results

notebooks/conservative_arch/scaleup/results/semsimula_fock_attention_h128/ — training logs, loss curves, and experiment report.

Head / d_k Sweep

Arm Heads d_k Steps PPL
direct_K4_h1_8k 1 64 8k 11.48
direct_K4_h4_8k 4 32 8k 10.93
direct_K8_h4_8k 8 32 8k 11.65
direct_K4_h4_16k 4 32 16k 9.42

Evaluation Results

TinyStories Validation Perplexity

Model PPL Params Gap vs Attention
Matched Attention (baseline) 7.81 19.5M --
Hybrid SPLM+Attn 8.50 ~19.0M +0.69
Fock-PARFLM v2.1 9.30 17.4M +1.49
Fock Attention (this model) 9.42 16.7M +1.61
Multi-Xi PARFLM 12.06 17.6M +4.25
Multi-Xi SPLM 11.51 16.5M +3.70

The Fock Attention is 0.12 PPL behind the register-based Fock-PARFLM v2.1, confirming that register persistence provides a small advantage. However, Fock Attention has the smallest parameter count in the family (16.7M) and the simplest Fock mechanism (no creation/destruction gates, no stack discipline).

SPLM Family Overview

This model is part of the Semantic Simulation SPLM family:

Model Design Inference HuggingFace
Multi-Xi SPLM Pure scalar potential, K-EMA context O(1)O(1) semsimula-splm-multixi
Hybrid SPLM+Attn Attention front-end + SPLM refinement O(T)O(T) semsimula-hybrid-splm
Multi-Xi PARFLM Scalar potential + sparse pairwise forces O(1)O(1) semsimula-parflm-multixi
Fock-PARFLM v2.1 PARFLM + Fock register pool (mediated exchange) O(1)O(1) semsimula-fock-parflm
Fock Attention PARFLM + direct token-to-token exchange O(T2)O(T^2) this model

Bias, Risks, and Limitations

  • Research checkpoint only. Proof-of-concept for direct exchange force as a non-conservative complement to conservative dynamics.
  • TinyStories only. Trained exclusively on synthetic children's stories (~5M tokens).
  • English only. No multilingual capability.
  • Small scale. 16.7M parameters, 256-dim hidden states.
  • No safety training. No RLHF, DPO, or safety filtering has been applied.

Citation

@misc{Gueorguiev2026SemSim,
  author    = {Gueorguiev, Dimitar P.},
  title     = {Semantic Simulation: A Prescriptive Lagrangian Framework
               for Efficient Semantic Inference --- A Conservative-by-
               Construction Language Model and the Shared-Potential
               Separator, with a Correspondence to Joint Embedding
               Predictive Architectures},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.19712427},
  url       = {https://doi.org/10.5281/zenodo.19712427},
  note      = {Version v15 (Jun 7, 2026).
               Companion code repository (DOI 10.5281/zenodo.20579561):
               \url{https://github.com/dimitarpg13/semsimula-paper}}
}

Environmental Impact

  • Hardware: NVIDIA A100/H100 (Google Colab)
  • Training time: ~8 hours (16,000 steps with gradient checkpointing)
  • Carbon footprint: Estimated < 3 kg CO2
Downloads last month
287
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train dimitarpg13/semsimula-fock-attention

Collection including dimitarpg13/semsimula-fock-attention

Evaluation results