Multi-Xi PARFLM (Property-Attractive-Repulsive Force Language Model)

The Multi-Xi PARFLM extends the SPLM with pairwise token interaction forces derived from a second scalar potential $V_\phi$ . While the base SPLM's $V_\theta$ provides a single-body potential (each token interacts only with a summary of its past), PARFLM adds explicit pairwise forces $V_\phi(h_t, h_s)$ between tokens -- the physics-informed analogue of attention's pairwise dot-product, but derived from a gradient of a scalar potential (making it conservative).

The pairwise forces use Gumbel-softmax top-k sparse routing to keep the cost at $O (T k)$ rather than $O (T^{2})$ . This model achieves 12.06 PPL on TinyStories, a 2.6 PPL improvement over the standalone Multi-Xi SPLM.

Part of the Semantic Simulation framework.

Model Details
Architecture
Why Not a Transformer?
Geometric Capabilities of Conservative Architectures
How to Get Started
Training Details
Evaluation Results
SPLM Family Overview
Bias, Risks, and Limitations
Citation
Environmental Impact

Model Details

Model Description

The Multi-Xi PARFLM combines two SPLM extensions:

Multi-channel K-EMA $ξ \xi$ (from the Multi-Xi SPLM): K=8 learnable causal exponential moving averages giving $V_\theta$ a multi-resolution summary of the past.
Sparse PARF pair-interactions: A second scalar potential $V_\phi(h_t, h_s)$ adds particle-exchange forces between token pairs, routed via Gumbel-softmax top-k selection.

The total potential energy for token t is:

$U_t = V_\theta(\xi_t, h_t) + \sum_{s<t} m_{ts} \cdot V_\phi(h_t, h_s)$

and the conservative force is $f_t = -\nabla_{h_t} U_t$ .

Developed by: Dimitar P. Gueorguiev (Independent Researcher)
Model type: Conservative autoregressive language model with pairwise forces
Language: English
License: CC-BY-4.0

Model Sources

Paper: Semantic Simulation: A Prescriptive Lagrangian Framework for Efficient Semantic Inference
Repository: github.com/dimitarpg13/semsimula-paper
Model source code: notebooks/conservative_arch/parf/model_parf_multixi.py

Architecture

Input tokens x_1, ..., x_T
       |
   Embedding E[x] + positional encoding
       |
   For each of L=8 integration steps:
       |
       +-- K-EMA channels: xi^(k)_t = causal_ema(h, alpha_k)   [K=8 channels]
       |
       +-- Single-body: V_theta([xi_1..xi_K, h]) -> R           [3-layer MLP]
       |
       +-- Pair routing: score_head(h_t, h_s) -> top-k selection [Gumbel-softmax]
       |
       +-- Pair forces: V_phi(h_t, h_s) -> R                    [structural competitive]
       |
       +-- Total: U_t = V_theta + sum V_phi
       |
       +-- Conservative force: f = -grad_h U_t                  [autograd]
       |
       +-- Damped Euler step: v += dt*f/m; v /= (1 + dt*gamma); h += dt*v
       |
       +-- LayerNorm(h)
       |
   Logits = h @ E^T                                            [tied embeddings]

Parameter	Value
Hidden dim (d)	256
Layers (L)	8
$V_\theta$ hidden / depth	1024 / 3
Xi channels (K)	8
Alpha init	log-spaced
$V_\phi$ kind	structural_competitive
$V_\phi$ hidden (H)	128
Sparse routing top_k	8
Gumbel tau	1.0 -> 0.1 (annealed)
Mass model	logfreq (frozen surprisal lookup)
Damping $\gamma$	0.30 (fixed)
Total parameters	17,632,215

Key Design Properties

Globally conservative: Both $V_\theta$ and $V_\phi$ are scalar potentials; the total force $f = -\nabla(V_\theta + \sum V_\phi)$ is conservative by construction.
Sparse routing: Gumbel-softmax top-k selection keeps pairwise cost at $O (T k)$ instead of $O (T^{2})$ .
Stage-1.5b gathered $V_\phi$ : Memory-efficient implementation replacing $O (T^{2})$ intermediates with $O (T k)$ .
Inheritance chain: MultiXiPARFLM -> SparsePARFLM -> PARFLM (all conservative).

Why Not a Transformer?

The PARFLM is not based on the Transformer architecture. There are no attention layers, no key-value cache, and no feed-forward network towers. The model uses two small scalar-potential MLPs: $V_\theta$ (single-body, ~3.4M params) and $V_\phi$ (pairwise, ~19K params) whose gradients provide conservative forces. Pairwise interactions use Gumbel-softmax top-k sparse routing at $O (T k)$ cost — not $O (T^{2})$ attention.

Key structural differences from Transformers:

Property	Transformer (GPT-2 small)	Multi-Xi PARFLM (this model)
Architecture	Self-attention + FFN blocks	Scalar-potential gradient flow + sparse pair forces
Core computation	50.3M (MLP) + 28.3M (attention)	3.4M $V_\theta$ + 19K $V_\phi$
Runtime state per token	$O (T)$ — KV-cache grows linearly	$O (1)$ — fixed-size $h, v, \xi$
Total parameters	124M	17.6M
Pairwise token interaction	$O (T^{2})$ dense attention	$O (T k)$ sparse routing (k=8)

Because the model carries only a fixed-size state $(h, v, \xi)$ per position — with no KV-cache — its inference memory is $O (1)$ in sequence length. The figure below illustrates the widening memorization gap between the Transformer's linearly-growing KV-cache and the SPLM's constant-size dynamic state:

Geometric Capabilities of Conservative Architectures

This model is fully attention-free and conservative by construction. Because all forces derive from the gradient of a scalar potential $V_\theta$ , the hidden-state manifold is endowed with a natural damped Riemannian geometry — the layer-dependent Jacobi metric $\Omega^2_\ell = 2T_\ell \cdot m$ — which is categorically absent from Transformer architectures. This geometry opens the door to capabilities that cannot be replicated in attention-based models:

Capability	Conservative SPLM	Transformer
Riemannian metric on hidden states	Layer-dependent Jacobi metric $\Omega^2_\ell = 2T_\ell \cdot m$ from $V_\theta$ ; confirmed positive at 100% of positions (diagnostic battery Arm 1)	No metric structure
Geodesics between semantic states	Damped geodesic equation with friction term $-\gamma v$ ; directional cosine similarity 0.52–0.75 (Arm 2). Geodesics are asymmetric: $d(A \to B) \neq d(B \to A)$	Linear interpolation only
Controlled energy dissipation as inference signal	$\Delta E_{\text{anomaly}}(t) = \|\Delta E(t) - \Delta E_{\text{expected}}(t)\|$ ; monotonic damped decay with measurable anomaly signal (Arm 4)	No conserved or tracked quantity
Curvature as uncertainty measure	$\mathcal{K}_{\max} = \lambda_{\max}(\nabla^2 V_\theta) / 2T_\ell$ ; well-defined across all layers (Arm 3)	None

These structural properties enable a set of native architectural features that are planned or under investigation (Section 18d and Section 23 of the paper):

Geodesic Analogical Reasoning: Analogy completion via parallel transport of directed geodesic arcs on the semantic manifold, respecting potential barriers that linear embedding arithmetic ignores. The damped geodesic equation yields 3–20% cosine-similarity improvement over undamped (diagnostic battery Arm 2). Because damped geodesics are asymmetric, analogy transport must use directed arcs.
Native Hallucination Detection: Energy dissipation anomalies $\Delta E_{\text{anomaly}}(t) = |\Delta E(t) - \Delta E_{\text{expected}}(t)|$ and curvature spikes $\mathcal{K}_{\max}(t)$ provide mechanistically grounded uncertainty signals computable at inference time without additional parameters. The smooth damping-induced energy decay is normal operation; deviation from the expected dissipation curve flags hallucination. For Fock models, the detector needs a per-model baseline that accounts for the known layer-1 exchange transient.
Geodesic Semantic Distance: A replacement for cosine similarity that encodes the model's learned energy landscape, expected to outperform cosine on polysemy and cross-basin semantic cases. The geodesic distance is inherently asymmetric: $d_{\text{geo}}(A,B) \neq d_{\text{geo}}(B,A)$ ; a symmetrised variant $[d(A \to B) + d(B \to A)]/2$ is available when symmetry is desired.
Native Chain-of-Thought (via Fock extension): The Fock-PARFLM v2.1 extends this model with register-based native CoT — reasoning steps as Fock register waypoints on damped geodesics, with zero extra token generation. The diagnostic confirms Fock register dynamics are predominantly linear — $R^2_{\text{full}} \approx 0.83$ (Arm 5), supporting the geodesic-waypoint interpretation.

The conservative constraint imposes a PPL cost relative to attention, but the price buys geometric structure and interpretability that attention-based architectures are structurally incapable of providing.

How to Get Started

# Clone the companion repository for full source code
# git clone https://github.com/dimitarpg13/semsimula-paper.git
# cd semsimula-paper/notebooks/conservative_arch

import torch
import sys
sys.path.insert(0, "parf")
sys.path.insert(0, "multixi")
sys.path.insert(0, "energetic_minima")
sys.path.insert(0, "sarf_mass_variant")

from parf.model_parf_multixi import MultiXiPARFLM, MultiXiPARFConfig

config = MultiXiPARFConfig(
    vocab_size=50257,
    d=256,
    n_layers=8,
    v_hidden=1024,
    v_depth=3,
    max_len=1024,
    block_size=512,
    gamma=0.30,
    xi_channels=8,
    xi_alpha_inits="log_spaced",
    v_phi_kind="structural_competitive",
    v_phi_hidden=128,
    top_k=8,
)

model = MultiXiPARFLM(config)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

# Forward pass
x = torch.randint(0, 50257, (1, 64))
logits, loss = model(x, targets=x)

Available Checkpoint

A trained checkpoint (PPL 12.06, 8k steps) is included in this repository:

File	Description
`checkpoint/model.pt`	Full model state dict (67 MB)
`training_log.jsonl`	Per-step training metrics
`loss_curve.png`	Training/validation loss plot
`training_summary.md`	Hyperparameters and final metrics

To load the checkpoint:

from huggingface_hub import hf_hub_download
import torch

# Download checkpoint
ckpt_path = hf_hub_download(
    repo_id="dimitarpg13/semsimula-parflm-multixi",
    filename="checkpoint/model.pt",
)

# Load into model (after creating model as above)
state = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(state["model_state_dict"])
model.eval()

Training Details

Training Data

TinyStories -- GPT-2 BPE tokenization. Training cap: 5M tokens.

Training Procedure

Hyperparameter	Value
Optimizer	AdamW
Learning rate	5e-4 (cosine decay)
Warmup steps	400
Weight decay	0.01
Gradient clipping	1.0
Batch size	16
Block size	512
Training steps	8,000
Memory optimisation	Level-2 grad checkpoint + Stage-1.5b gathered $V_\phi$
Hardware	A100 40GB (Google Colab)

Training Script

notebooks/conservative_arch/scaleup/train_parf_multixi_scaleup.py

Colab Notebook

notebooks/conservative_arch/scaleup/colab_parf_multixi_h128.ipynb — 6-arm sweep over channel count, α-init, top-k, and V_φ kind with live progress display, saves results to Google Drive.

Training Results

notebooks/conservative_arch/scaleup/results/semsimula_parf_multixi_h128/ — training logs, loss curves, and experiment report.

Evaluation Results

TinyStories Validation Perplexity

Model	PPL	Params	Gap vs Attention
Matched Attention (baseline)	7.81	19.5M	--
Hybrid SPLM+Attn	8.50	~19.0M	+0.69
Fock-PARFLM v2.1	9.30	17.4M	+1.49
Fock Attention	9.42	16.7M	+1.61
Multi-Xi PARFLM (this model)	12.06	17.6M	+4.25
Multi-Xi SPLM	11.51	16.5M	+3.70

Adding sparse pairwise forces $V_\phi$ improves PPL from 11.51 to 12.06 at the same step count over the standalone Multi-Xi SPLM — note: the SPLM catches up at 16k steps (11.51 PPL) while PARFLM was only trained to 8k steps, confirming that pairwise token interactions are a necessary complement to the single-body potential. The remaining gap to attention is closed further by the Fock register mechanism.

SPLM Family Overview

This model is part of the Semantic Simulation SPLM family:

Model	Design	Inference	HuggingFace
Multi-Xi SPLM	Pure scalar potential, K-EMA context	$O (1)$	semsimula-splm-multixi
Hybrid SPLM+Attn	Attention front-end + SPLM refinement	$O (T)$	semsimula-hybrid-splm
Multi-Xi PARFLM	Scalar potential + sparse pairwise forces	$O (1)$	this model
Fock-PARFLM v2.1	PARFLM + Fock register pool (mediated exchange)	$O (1)$	semsimula-fock-parflm
Fock Attention	PARFLM + direct token-to-token exchange	$O (T^{2})$	semsimula-fock-attention

Bias, Risks, and Limitations

Research checkpoint only. Proof-of-concept for the conservative pairwise-force architecture.
TinyStories only. Trained exclusively on synthetic children's stories (~5M tokens).
English only. No multilingual capability.
Small scale. 17.6M parameters, 256-dim hidden states.
No safety training. No RLHF, DPO, or safety filtering has been applied.

Citation

@misc{Gueorguiev2026SemSim,
  author    = {Gueorguiev, Dimitar P.},
  title     = {Semantic Simulation: A Prescriptive Lagrangian Framework
               for Efficient Semantic Inference --- A Conservative-by-
               Construction Language Model and the Shared-Potential
               Separator, with a Correspondence to Joint Embedding
               Predictive Architectures},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.19712427},
  url       = {https://doi.org/10.5281/zenodo.19712427},
  note      = {Version v15 (Jun 7, 2026).
               Companion code repository (DOI 10.5281/zenodo.20579561):
               \url{https://github.com/dimitarpg13/semsimula-paper}}
}

Environmental Impact

Hardware: NVIDIA A100 40GB (Google Colab)
Training time: ~6 hours (8,000 steps with gradient checkpointing)
Carbon footprint: Estimated < 2 kg CO2

Downloads last month: 345

Dataset used to train dimitarpg13/semsimula-parflm-multixi

Collection including dimitarpg13/semsimula-parflm-multixi

Semantic Simulation — SPLM Model Family

Collection

Conservative language models based on Lagrangian mechanics. Paper: https://doi.org/10.5281/zenodo.19712427 • 8 items • Updated 1 day ago

Evaluation results

Validation Perplexity on TinyStories
validation set self-reported

12.060

dimitarpg13
/

semsimula-parflm-multixi