Multi-Xi PARFLM (Property-Attractive-Repulsive Force Language Model)

The Multi-Xi PARFLM extends the SPLM with pairwise token interaction forces derived from a second scalar potential VϕV_\phi. While the base SPLM's VθV_\theta provides a single-body potential (each token interacts only with a summary of its past), PARFLM adds explicit pairwise forces Vϕ(ht,hs)V_\phi(h_t, h_s) between tokens -- the physics-informed analogue of attention's pairwise dot-product, but derived from a gradient of a scalar potential (making it conservative).

The pairwise forces use Gumbel-softmax top-k sparse routing to keep the cost at O(Tk)O(Tk) rather than O(T2)O(T^2). This model achieves 12.06 PPL on TinyStories, a 2.6 PPL improvement over the standalone Multi-Xi SPLM.

Part of the Semantic Simulation framework.

Table of Contents

Model Details

Model Description

The Multi-Xi PARFLM combines two SPLM extensions:

  1. Multi-channel K-EMA ξ\xi (from the Multi-Xi SPLM): K=8 learnable causal exponential moving averages giving VθV_\theta a multi-resolution summary of the past.
  2. Sparse PARF pair-interactions: A second scalar potential Vϕ(ht,hs)V_\phi(h_t, h_s) adds particle-exchange forces between token pairs, routed via Gumbel-softmax top-k selection.

The total potential energy for token t is:

Ut=Vθ(ξt,ht)+s<tmtsVϕ(ht,hs)U_t = V_\theta(\xi_t, h_t) + \sum_{s<t} m_{ts} \cdot V_\phi(h_t, h_s)

and the conservative force is ft=htUtf_t = -\nabla_{h_t} U_t.

  • Developed by: Dimitar P. Gueorguiev (Independent Researcher)
  • Model type: Conservative autoregressive language model with pairwise forces
  • Language: English
  • License: CC-BY-4.0

Model Sources

Architecture

Input tokens x_1, ..., x_T
       |
   Embedding E[x] + positional encoding
       |
   For each of L=8 integration steps:
       |
       +-- K-EMA channels: xi^(k)_t = causal_ema(h, alpha_k)   [K=8 channels]
       |
       +-- Single-body: V_theta([xi_1..xi_K, h]) -> R           [3-layer MLP]
       |
       +-- Pair routing: score_head(h_t, h_s) -> top-k selection [Gumbel-softmax]
       |
       +-- Pair forces: V_phi(h_t, h_s) -> R                    [structural competitive]
       |
       +-- Total: U_t = V_theta + sum V_phi
       |
       +-- Conservative force: f = -grad_h U_t                  [autograd]
       |
       +-- Damped Euler step: v += dt*f/m; v /= (1 + dt*gamma); h += dt*v
       |
       +-- LayerNorm(h)
       |
   Logits = h @ E^T                                            [tied embeddings]
Parameter Value
Hidden dim (d) 256
Layers (L) 8
VθV_\theta hidden / depth 1024 / 3
Xi channels (K) 8
Alpha init log-spaced
VϕV_\phi kind structural_competitive
VϕV_\phi hidden (H) 128
Sparse routing top_k 8
Gumbel tau 1.0 -> 0.1 (annealed)
Mass model logfreq (frozen surprisal lookup)
Damping γ\gamma 0.30 (fixed)
Total parameters 17,632,215

Key Design Properties

  • Globally conservative: Both VθV_\theta and VϕV_\phi are scalar potentials; the total force f=(Vθ+Vϕ)f = -\nabla(V_\theta + \sum V_\phi) is conservative by construction.
  • Sparse routing: Gumbel-softmax top-k selection keeps pairwise cost at O(Tk)O(Tk) instead of O(T2)O(T^2).
  • Stage-1.5b gathered VϕV_\phi: Memory-efficient implementation replacing O(T2)O(T^2) intermediates with O(Tk)O(Tk).
  • Inheritance chain: MultiXiPARFLM -> SparsePARFLM -> PARFLM (all conservative).

Why Not a Transformer?

The PARFLM is not based on the Transformer architecture. There are no attention layers, no key-value cache, and no feed-forward network towers. The model uses two small scalar-potential MLPs: VθV_\theta (single-body, ~3.4M params) and VϕV_\phi (pairwise, ~19K params) whose gradients provide conservative forces. Pairwise interactions use Gumbel-softmax top-k sparse routing at O(Tk)O(Tk) cost — not O(T2)O(T^2) attention.

Key structural differences from Transformers:

Property Transformer (GPT-2 small) Multi-Xi PARFLM (this model)
Architecture Self-attention + FFN blocks Scalar-potential gradient flow + sparse pair forces
Core computation 50.3M (MLP) + 28.3M (attention) 3.4M VθV_\theta + 19K VϕV_\phi
Runtime state per token O(T)O(T) — KV-cache grows linearly O(1)O(1) — fixed-size h,v,ξh, v, \xi
Total parameters 124M 17.6M
Pairwise token interaction O(T2)O(T^2) dense attention O(Tk)O(Tk) sparse routing (k=8)

Because the model carries only a fixed-size state (h,v,ξ)(h, v, \xi) per position — with no KV-cache — its inference memory is O(1)O(1) in sequence length. The figure below illustrates the widening memorization gap between the Transformer's linearly-growing KV-cache and the SPLM's constant-size dynamic state:

Runtime information capacity vs sequence length

Geometric Capabilities of Conservative Architectures

This model is fully attention-free and conservative by construction. Because all forces derive from the gradient of a scalar potential VθV_\theta, the hidden-state manifold is endowed with a natural damped Riemannian geometry — the layer-dependent Jacobi metric Ω2=2Tm\Omega^2_\ell = 2T_\ell \cdot m — which is categorically absent from Transformer architectures. This geometry opens the door to capabilities that cannot be replicated in attention-based models:

Capability Conservative SPLM Transformer
Riemannian metric on hidden states Layer-dependent Jacobi metric Ω2=2Tm\Omega^2_\ell = 2T_\ell \cdot m from VθV_\theta; confirmed positive at 100% of positions (diagnostic battery Arm 1) No metric structure
Geodesics between semantic states Damped geodesic equation with friction term γv-\gamma v; directional cosine similarity 0.52–0.75 (Arm 2). Geodesics are asymmetric: d(AB)d(BA)d(A \to B) \neq d(B \to A) Linear interpolation only
Controlled energy dissipation as inference signal ΔEanomaly(t)=ΔE(t)ΔEexpected(t)\Delta E_{\text{anomaly}}(t) = |\Delta E(t) - \Delta E_{\text{expected}}(t)|; monotonic damped decay with measurable anomaly signal (Arm 4) No conserved or tracked quantity
Curvature as uncertainty measure Kmax=λmax(2Vθ)/2T\mathcal{K}_{\max} = \lambda_{\max}(\nabla^2 V_\theta) / 2T_\ell; well-defined across all layers (Arm 3) None

These structural properties enable a set of native architectural features that are planned or under investigation (Section 18d and Section 23 of the paper):

  • Geodesic Analogical Reasoning: Analogy completion via parallel transport of directed geodesic arcs on the semantic manifold, respecting potential barriers that linear embedding arithmetic ignores. The damped geodesic equation yields 3–20% cosine-similarity improvement over undamped (diagnostic battery Arm 2). Because damped geodesics are asymmetric, analogy transport must use directed arcs.
  • Native Hallucination Detection: Energy dissipation anomalies ΔEanomaly(t)=ΔE(t)ΔEexpected(t)\Delta E_{\text{anomaly}}(t) = |\Delta E(t) - \Delta E_{\text{expected}}(t)| and curvature spikes Kmax(t)\mathcal{K}_{\max}(t) provide mechanistically grounded uncertainty signals computable at inference time without additional parameters. The smooth damping-induced energy decay is normal operation; deviation from the expected dissipation curve flags hallucination. For Fock models, the detector needs a per-model baseline that accounts for the known layer-1 exchange transient.
  • Geodesic Semantic Distance: A replacement for cosine similarity that encodes the model's learned energy landscape, expected to outperform cosine on polysemy and cross-basin semantic cases. The geodesic distance is inherently asymmetric: dgeo(A,B)dgeo(B,A)d_{\text{geo}}(A,B) \neq d_{\text{geo}}(B,A); a symmetrised variant [d(AB)+d(BA)]/2[d(A \to B) + d(B \to A)]/2 is available when symmetry is desired.
  • Native Chain-of-Thought (via Fock extension): The Fock-PARFLM v2.1 extends this model with register-based native CoT — reasoning steps as Fock register waypoints on damped geodesics, with zero extra token generation. The diagnostic confirms Fock register dynamics are predominantly linear — Rfull20.83R^2_{\text{full}} \approx 0.83 (Arm 5), supporting the geodesic-waypoint interpretation.

The conservative constraint imposes a PPL cost relative to attention, but the price buys geometric structure and interpretability that attention-based architectures are structurally incapable of providing.

How to Get Started

# Clone the companion repository for full source code
# git clone https://github.com/dimitarpg13/semsimula-paper.git
# cd semsimula-paper/notebooks/conservative_arch

import torch
import sys
sys.path.insert(0, "parf")
sys.path.insert(0, "multixi")
sys.path.insert(0, "energetic_minima")
sys.path.insert(0, "sarf_mass_variant")

from parf.model_parf_multixi import MultiXiPARFLM, MultiXiPARFConfig

config = MultiXiPARFConfig(
    vocab_size=50257,
    d=256,
    n_layers=8,
    v_hidden=1024,
    v_depth=3,
    max_len=1024,
    block_size=512,
    gamma=0.30,
    xi_channels=8,
    xi_alpha_inits="log_spaced",
    v_phi_kind="structural_competitive",
    v_phi_hidden=128,
    top_k=8,
)

model = MultiXiPARFLM(config)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

# Forward pass
x = torch.randint(0, 50257, (1, 64))
logits, loss = model(x, targets=x)

Available Checkpoint

A trained checkpoint (PPL 12.06, 8k steps) is included in this repository:

File Description
checkpoint/model.pt Full model state dict (67 MB)
training_log.jsonl Per-step training metrics
loss_curve.png Training/validation loss plot
training_summary.md Hyperparameters and final metrics

To load the checkpoint:

from huggingface_hub import hf_hub_download
import torch

# Download checkpoint
ckpt_path = hf_hub_download(
    repo_id="dimitarpg13/semsimula-parflm-multixi",
    filename="checkpoint/model.pt",
)

# Load into model (after creating model as above)
state = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(state["model_state_dict"])
model.eval()

Training Details

Training Data

TinyStories -- GPT-2 BPE tokenization. Training cap: 5M tokens.

Training Procedure

Hyperparameter Value
Optimizer AdamW
Learning rate 5e-4 (cosine decay)
Warmup steps 400
Weight decay 0.01
Gradient clipping 1.0
Batch size 16
Block size 512
Training steps 8,000
Memory optimisation Level-2 grad checkpoint + Stage-1.5b gathered VϕV_\phi
Hardware A100 40GB (Google Colab)

Training Script

notebooks/conservative_arch/scaleup/train_parf_multixi_scaleup.py

Colab Notebook

notebooks/conservative_arch/scaleup/colab_parf_multixi_h128.ipynb — 6-arm sweep over channel count, α-init, top-k, and V_φ kind with live progress display, saves results to Google Drive.

Training Results

notebooks/conservative_arch/scaleup/results/semsimula_parf_multixi_h128/ — training logs, loss curves, and experiment report.

Evaluation Results

TinyStories Validation Perplexity

Model PPL Params Gap vs Attention
Matched Attention (baseline) 7.81 19.5M --
Hybrid SPLM+Attn 8.50 ~19.0M +0.69
Fock-PARFLM v2.1 9.30 17.4M +1.49
Fock Attention 9.42 16.7M +1.61
Multi-Xi PARFLM (this model) 12.06 17.6M +4.25
Multi-Xi SPLM 11.51 16.5M +3.70

Adding sparse pairwise forces VϕV_\phi improves PPL from 11.51 to 12.06 at the same step count over the standalone Multi-Xi SPLM — note: the SPLM catches up at 16k steps (11.51 PPL) while PARFLM was only trained to 8k steps, confirming that pairwise token interactions are a necessary complement to the single-body potential. The remaining gap to attention is closed further by the Fock register mechanism.

SPLM Family Overview

This model is part of the Semantic Simulation SPLM family:

Model Design Inference HuggingFace
Multi-Xi SPLM Pure scalar potential, K-EMA context O(1)O(1) semsimula-splm-multixi
Hybrid SPLM+Attn Attention front-end + SPLM refinement O(T)O(T) semsimula-hybrid-splm
Multi-Xi PARFLM Scalar potential + sparse pairwise forces O(1)O(1) this model
Fock-PARFLM v2.1 PARFLM + Fock register pool (mediated exchange) O(1)O(1) semsimula-fock-parflm
Fock Attention PARFLM + direct token-to-token exchange O(T2)O(T^2) semsimula-fock-attention

Bias, Risks, and Limitations

  • Research checkpoint only. Proof-of-concept for the conservative pairwise-force architecture.
  • TinyStories only. Trained exclusively on synthetic children's stories (~5M tokens).
  • English only. No multilingual capability.
  • Small scale. 17.6M parameters, 256-dim hidden states.
  • No safety training. No RLHF, DPO, or safety filtering has been applied.

Citation

@misc{Gueorguiev2026SemSim,
  author    = {Gueorguiev, Dimitar P.},
  title     = {Semantic Simulation: A Prescriptive Lagrangian Framework
               for Efficient Semantic Inference --- A Conservative-by-
               Construction Language Model and the Shared-Potential
               Separator, with a Correspondence to Joint Embedding
               Predictive Architectures},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.19712427},
  url       = {https://doi.org/10.5281/zenodo.19712427},
  note      = {Version v15 (Jun 7, 2026).
               Companion code repository (DOI 10.5281/zenodo.20579561):
               \url{https://github.com/dimitarpg13/semsimula-paper}}
}

Environmental Impact

  • Hardware: NVIDIA A100 40GB (Google Colab)
  • Training time: ~6 hours (8,000 steps with gradient checkpointing)
  • Carbon footprint: Estimated < 2 kg CO2
Downloads last month
345
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train dimitarpg13/semsimula-parflm-multixi

Collection including dimitarpg13/semsimula-parflm-multixi

Evaluation results