- Multi-Xi PARFLM (Property-Attractive-Repulsive Force Language Model)
Multi-Xi PARFLM (Property-Attractive-Repulsive Force Language Model)
The Multi-Xi PARFLM extends the SPLM with pairwise token interaction forces derived from a second scalar potential . While the base SPLM's provides a single-body potential (each token interacts only with a summary of its past), PARFLM adds explicit pairwise forces between tokens -- the physics-informed analogue of attention's pairwise dot-product, but derived from a gradient of a scalar potential (making it conservative).
The pairwise forces use Gumbel-softmax top-k sparse routing to keep the cost at rather than . This model achieves 12.06 PPL on TinyStories, a 2.6 PPL improvement over the standalone Multi-Xi SPLM.
Part of the Semantic Simulation framework.
Table of Contents
- Model Details
- Architecture
- Why Not a Transformer?
- Geometric Capabilities of Conservative Architectures
- How to Get Started
- Training Details
- Evaluation Results
- SPLM Family Overview
- Bias, Risks, and Limitations
- Citation
- Environmental Impact
Model Details
Model Description
The Multi-Xi PARFLM combines two SPLM extensions:
- Multi-channel K-EMA (from the Multi-Xi SPLM): K=8 learnable causal exponential moving averages giving a multi-resolution summary of the past.
- Sparse PARF pair-interactions: A second scalar potential adds particle-exchange forces between token pairs, routed via Gumbel-softmax top-k selection.
The total potential energy for token t is:
and the conservative force is .
- Developed by: Dimitar P. Gueorguiev (Independent Researcher)
- Model type: Conservative autoregressive language model with pairwise forces
- Language: English
- License: CC-BY-4.0
Model Sources
- Paper: Semantic Simulation: A Prescriptive Lagrangian Framework for Efficient Semantic Inference
- Repository: github.com/dimitarpg13/semsimula-paper
- Model source code:
notebooks/conservative_arch/parf/model_parf_multixi.py
Architecture
Input tokens x_1, ..., x_T
|
Embedding E[x] + positional encoding
|
For each of L=8 integration steps:
|
+-- K-EMA channels: xi^(k)_t = causal_ema(h, alpha_k) [K=8 channels]
|
+-- Single-body: V_theta([xi_1..xi_K, h]) -> R [3-layer MLP]
|
+-- Pair routing: score_head(h_t, h_s) -> top-k selection [Gumbel-softmax]
|
+-- Pair forces: V_phi(h_t, h_s) -> R [structural competitive]
|
+-- Total: U_t = V_theta + sum V_phi
|
+-- Conservative force: f = -grad_h U_t [autograd]
|
+-- Damped Euler step: v += dt*f/m; v /= (1 + dt*gamma); h += dt*v
|
+-- LayerNorm(h)
|
Logits = h @ E^T [tied embeddings]
| Parameter | Value |
|---|---|
| Hidden dim (d) | 256 |
| Layers (L) | 8 |
| hidden / depth | 1024 / 3 |
| Xi channels (K) | 8 |
| Alpha init | log-spaced |
| kind | structural_competitive |
| hidden (H) | 128 |
| Sparse routing top_k | 8 |
| Gumbel tau | 1.0 -> 0.1 (annealed) |
| Mass model | logfreq (frozen surprisal lookup) |
| Damping | 0.30 (fixed) |
| Total parameters | 17,632,215 |
Key Design Properties
- Globally conservative: Both and are scalar potentials; the total force is conservative by construction.
- Sparse routing: Gumbel-softmax top-k selection keeps pairwise cost at instead of .
- Stage-1.5b gathered : Memory-efficient implementation replacing intermediates with .
- Inheritance chain: MultiXiPARFLM -> SparsePARFLM -> PARFLM (all conservative).
Why Not a Transformer?
The PARFLM is not based on the Transformer architecture. There are no attention layers, no key-value cache, and no feed-forward network towers. The model uses two small scalar-potential MLPs: (single-body, ~3.4M params) and (pairwise, ~19K params) whose gradients provide conservative forces. Pairwise interactions use Gumbel-softmax top-k sparse routing at cost — not attention.
Key structural differences from Transformers:
| Property | Transformer (GPT-2 small) | Multi-Xi PARFLM (this model) |
|---|---|---|
| Architecture | Self-attention + FFN blocks | Scalar-potential gradient flow + sparse pair forces |
| Core computation | 50.3M (MLP) + 28.3M (attention) | 3.4M + 19K |
| Runtime state per token | — KV-cache grows linearly | — fixed-size |
| Total parameters | 124M | 17.6M |
| Pairwise token interaction | dense attention | sparse routing (k=8) |
Because the model carries only a fixed-size state per position — with no KV-cache — its inference memory is in sequence length. The figure below illustrates the widening memorization gap between the Transformer's linearly-growing KV-cache and the SPLM's constant-size dynamic state:
Geometric Capabilities of Conservative Architectures
This model is fully attention-free and conservative by construction. Because all forces derive from the gradient of a scalar potential , the hidden-state manifold is endowed with a natural damped Riemannian geometry — the layer-dependent Jacobi metric — which is categorically absent from Transformer architectures. This geometry opens the door to capabilities that cannot be replicated in attention-based models:
| Capability | Conservative SPLM | Transformer |
|---|---|---|
| Riemannian metric on hidden states | Layer-dependent Jacobi metric from ; confirmed positive at 100% of positions (diagnostic battery Arm 1) | No metric structure |
| Geodesics between semantic states | Damped geodesic equation with friction term ; directional cosine similarity 0.52–0.75 (Arm 2). Geodesics are asymmetric: | Linear interpolation only |
| Controlled energy dissipation as inference signal | ; monotonic damped decay with measurable anomaly signal (Arm 4) | No conserved or tracked quantity |
| Curvature as uncertainty measure | ; well-defined across all layers (Arm 3) | None |
These structural properties enable a set of native architectural features that are planned or under investigation (Section 18d and Section 23 of the paper):
- Geodesic Analogical Reasoning: Analogy completion via parallel transport of directed geodesic arcs on the semantic manifold, respecting potential barriers that linear embedding arithmetic ignores. The damped geodesic equation yields 3–20% cosine-similarity improvement over undamped (diagnostic battery Arm 2). Because damped geodesics are asymmetric, analogy transport must use directed arcs.
- Native Hallucination Detection: Energy dissipation anomalies and curvature spikes provide mechanistically grounded uncertainty signals computable at inference time without additional parameters. The smooth damping-induced energy decay is normal operation; deviation from the expected dissipation curve flags hallucination. For Fock models, the detector needs a per-model baseline that accounts for the known layer-1 exchange transient.
- Geodesic Semantic Distance: A replacement for cosine similarity that encodes the model's learned energy landscape, expected to outperform cosine on polysemy and cross-basin semantic cases. The geodesic distance is inherently asymmetric: ; a symmetrised variant is available when symmetry is desired.
- Native Chain-of-Thought (via Fock extension): The Fock-PARFLM v2.1 extends this model with register-based native CoT — reasoning steps as Fock register waypoints on damped geodesics, with zero extra token generation. The diagnostic confirms Fock register dynamics are predominantly linear — (Arm 5), supporting the geodesic-waypoint interpretation.
The conservative constraint imposes a PPL cost relative to attention, but the price buys geometric structure and interpretability that attention-based architectures are structurally incapable of providing.
How to Get Started
# Clone the companion repository for full source code
# git clone https://github.com/dimitarpg13/semsimula-paper.git
# cd semsimula-paper/notebooks/conservative_arch
import torch
import sys
sys.path.insert(0, "parf")
sys.path.insert(0, "multixi")
sys.path.insert(0, "energetic_minima")
sys.path.insert(0, "sarf_mass_variant")
from parf.model_parf_multixi import MultiXiPARFLM, MultiXiPARFConfig
config = MultiXiPARFConfig(
vocab_size=50257,
d=256,
n_layers=8,
v_hidden=1024,
v_depth=3,
max_len=1024,
block_size=512,
gamma=0.30,
xi_channels=8,
xi_alpha_inits="log_spaced",
v_phi_kind="structural_competitive",
v_phi_hidden=128,
top_k=8,
)
model = MultiXiPARFLM(config)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
# Forward pass
x = torch.randint(0, 50257, (1, 64))
logits, loss = model(x, targets=x)
Available Checkpoint
A trained checkpoint (PPL 12.06, 8k steps) is included in this repository:
| File | Description |
|---|---|
checkpoint/model.pt |
Full model state dict (67 MB) |
training_log.jsonl |
Per-step training metrics |
loss_curve.png |
Training/validation loss plot |
training_summary.md |
Hyperparameters and final metrics |
To load the checkpoint:
from huggingface_hub import hf_hub_download
import torch
# Download checkpoint
ckpt_path = hf_hub_download(
repo_id="dimitarpg13/semsimula-parflm-multixi",
filename="checkpoint/model.pt",
)
# Load into model (after creating model as above)
state = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(state["model_state_dict"])
model.eval()
Training Details
Training Data
TinyStories -- GPT-2 BPE tokenization. Training cap: 5M tokens.
Training Procedure
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate | 5e-4 (cosine decay) |
| Warmup steps | 400 |
| Weight decay | 0.01 |
| Gradient clipping | 1.0 |
| Batch size | 16 |
| Block size | 512 |
| Training steps | 8,000 |
| Memory optimisation | Level-2 grad checkpoint + Stage-1.5b gathered |
| Hardware | A100 40GB (Google Colab) |
Training Script
notebooks/conservative_arch/scaleup/train_parf_multixi_scaleup.py
Colab Notebook
notebooks/conservative_arch/scaleup/colab_parf_multixi_h128.ipynb — 6-arm sweep over channel count, α-init, top-k, and V_φ kind with live progress display, saves results to Google Drive.
Training Results
notebooks/conservative_arch/scaleup/results/semsimula_parf_multixi_h128/ — training logs, loss curves, and experiment report.
Evaluation Results
TinyStories Validation Perplexity
| Model | PPL | Params | Gap vs Attention |
|---|---|---|---|
| Matched Attention (baseline) | 7.81 | 19.5M | -- |
| Hybrid SPLM+Attn | 8.50 | ~19.0M | +0.69 |
| Fock-PARFLM v2.1 | 9.30 | 17.4M | +1.49 |
| Fock Attention | 9.42 | 16.7M | +1.61 |
| Multi-Xi PARFLM (this model) | 12.06 | 17.6M | +4.25 |
| Multi-Xi SPLM | 11.51 | 16.5M | +3.70 |
Adding sparse pairwise forces improves PPL from 11.51 to 12.06 at the same step count over the standalone Multi-Xi SPLM — note: the SPLM catches up at 16k steps (11.51 PPL) while PARFLM was only trained to 8k steps, confirming that pairwise token interactions are a necessary complement to the single-body potential. The remaining gap to attention is closed further by the Fock register mechanism.
SPLM Family Overview
This model is part of the Semantic Simulation SPLM family:
| Model | Design | Inference | HuggingFace |
|---|---|---|---|
| Multi-Xi SPLM | Pure scalar potential, K-EMA context | semsimula-splm-multixi | |
| Hybrid SPLM+Attn | Attention front-end + SPLM refinement | semsimula-hybrid-splm | |
| Multi-Xi PARFLM | Scalar potential + sparse pairwise forces | this model | |
| Fock-PARFLM v2.1 | PARFLM + Fock register pool (mediated exchange) | semsimula-fock-parflm | |
| Fock Attention | PARFLM + direct token-to-token exchange | semsimula-fock-attention |
Bias, Risks, and Limitations
- Research checkpoint only. Proof-of-concept for the conservative pairwise-force architecture.
- TinyStories only. Trained exclusively on synthetic children's stories (~5M tokens).
- English only. No multilingual capability.
- Small scale. 17.6M parameters, 256-dim hidden states.
- No safety training. No RLHF, DPO, or safety filtering has been applied.
Citation
@misc{Gueorguiev2026SemSim,
author = {Gueorguiev, Dimitar P.},
title = {Semantic Simulation: A Prescriptive Lagrangian Framework
for Efficient Semantic Inference --- A Conservative-by-
Construction Language Model and the Shared-Potential
Separator, with a Correspondence to Joint Embedding
Predictive Architectures},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.19712427},
url = {https://doi.org/10.5281/zenodo.19712427},
note = {Version v15 (Jun 7, 2026).
Companion code repository (DOI 10.5281/zenodo.20579561):
\url{https://github.com/dimitarpg13/semsimula-paper}}
}
Environmental Impact
- Hardware: NVIDIA A100 40GB (Google Colab)
- Training time: ~6 hours (8,000 steps with gradient checkpointing)
- Carbon footprint: Estimated < 2 kg CO2
- Downloads last month
- 345
Dataset used to train dimitarpg13/semsimula-parflm-multixi
Collection including dimitarpg13/semsimula-parflm-multixi
Evaluation results
- Validation Perplexity on TinyStoriesvalidation set self-reported12.060
