- Multi-Xi SPLM (Scalar-Potential Language Model)
Multi-Xi SPLM (Scalar-Potential Language Model)
The Multi-Xi SPLM is a conservative-by-construction autoregressive language model whose inference is a damped Euler--Lagrange flow on a single learned scalar energy field. Unlike attention-based transformers, the model's next-token dynamics derive entirely from the gradient of a shared scalar potential , making the inference trajectory globally conservative and endowing the hidden-state manifold with a natural Riemannian geometry (the Jacobi metric).
This is the standalone (no attention) variant from the Semantic Simulation framework.
Table of Contents
- Model Details
- Architecture
- Why Not a Transformer?
- Geometric Capabilities of Conservative Architectures
- How to Get Started
- Training Details
- Evaluation Results
- SPLM Family Overview
- Bias, Risks, and Limitations
- Citation
- Environmental Impact
Model Details
Model Description
The Multi-Xi SPLM replaces the single causal cumulative-mean context summary of the baseline SPLM with K causal exponential moving averages (K-EMA) at multiple learnable decay scales. This gives the scalar potential a multi-resolution summary of the past, addressing the rank-1 information bottleneck that limits baseline SPLM performance.
At each integration step, the model computes:
- K causal EMA channels: (learnable )
- Scalar potential: (wide MLP)
- Conservative force:
- Damped dynamics: semi-implicit Euler with per-token mass and fixed damping
- Developed by: Dimitar P. Gueorguiev (Independent Researcher)
- Model type: Conservative autoregressive language model (Lagrangian dynamics)
- Language: English
- License: CC-BY-4.0
Model Sources
- Paper: Semantic Simulation: A Prescriptive Lagrangian Framework for Efficient Semantic Inference
- Repository: github.com/dimitarpg13/semsimula-paper
- Model source code:
notebooks/conservative_arch/multixi/model_multixi.py
Architecture
Input tokens x_1, ..., x_T
|
Embedding E[x] + positional encoding
|
For each of L=8 integration steps:
|
+-- K-EMA channels: xi^(k)_t = causal_ema(h, alpha_k) [K=4 channels]
|
+-- Scalar potential: V_theta([xi_1..xi_K, h]) -> R [3-layer MLP, hidden=1024]
|
+-- Conservative force: f = -grad_h V_theta [autograd]
|
+-- Damped Euler step: v += dt*f/m; v /= (1 + dt*gamma); h += dt*v
|
+-- LayerNorm(h) [LN-after-step]
|
Logits = h @ E^T [tied embeddings]
| Parameter | Value |
|---|---|
| Hidden dim (d) | 256 |
| Layers (L) | 8 |
| hidden / depth | 1024 / 3 |
| Xi channels (K) | 4 |
| Alpha init | [0.25, 0.5, 0.75, 0.95] (learned from uniform) |
| input dim | |
| Mass model | logfreq (frozen surprisal lookup) |
| Damping | 0.30 (fixed) |
| Total parameters | 16,539,911 |
Key Design Properties
- Globally conservative: All forces derive from -- the inference trajectory conserves a well-defined Hamiltonian (up to controlled damping).
- Single shared potential: One is shared across all integration steps, preserving the "single energy field" interpretation.
- Autograd-native: Forces are computed via PyTorch autograd, not manual Jacobians.
- No attention primitive: Token interactions occur only through the causal EMA context channels, not through pairwise attention.
Why Not a Transformer?
The SPLM family is not based on the Transformer architecture. There are no attention layers, no key-value cache, and no feed-forward network towers. Instead, the entire model dynamics are driven by a single small scalar-potential MLP, — 3-layer, 1024-hidden, ~3.4M parameters — whose gradient provides the conservative force at every integration step.
Key structural differences from Transformers:
| Property | Transformer (GPT-2 small) | Multi-Xi SPLM (this model) |
|---|---|---|
| Architecture | Self-attention + FFN blocks | Scalar-potential gradient flow |
| Core computation | 50.3M (MLP) + 28.3M (attention) | 3.4M MLP |
| Runtime state per token | — KV-cache grows linearly | — fixed-size |
| Total parameters | 124M | 16.5M |
| Pairwise token interaction | attention | None (causal EMA summary only) |
Because the model carries only a fixed-size state per position — with no KV-cache — its inference memory is in sequence length. The figure below illustrates the widening memorization gap between the Transformer's linearly-growing KV-cache and the SPLM's constant-size dynamic state:
Geometric Capabilities of Conservative Architectures
This model is fully attention-free and conservative by construction. Because all forces derive from the gradient of a scalar potential , the hidden-state manifold is endowed with a natural damped Riemannian geometry — the layer-dependent Jacobi metric — which is categorically absent from Transformer architectures. This geometry opens the door to capabilities that cannot be replicated in attention-based models:
| Capability | Conservative SPLM | Transformer |
|---|---|---|
| Riemannian metric on hidden states | Layer-dependent Jacobi metric from ; confirmed positive at 100% of positions (diagnostic battery Arm 1) | No metric structure |
| Geodesics between semantic states | Damped geodesic equation with friction term ; directional cosine similarity 0.52–0.75 (Arm 2). Geodesics are asymmetric: | Linear interpolation only |
| Controlled energy dissipation as inference signal | ; monotonic damped decay with measurable anomaly signal (Arm 4) | No conserved or tracked quantity |
| Curvature as uncertainty measure | ; well-defined across all layers (Arm 3) | None |
These structural properties enable a set of native architectural features that are planned or under investigation (Section 18d and Section 23 of the paper):
- Geodesic Analogical Reasoning: Analogy completion via parallel transport of directed geodesic arcs on the semantic manifold, respecting potential barriers that linear embedding arithmetic ignores. The damped geodesic equation yields 3–20% cosine-similarity improvement over undamped (diagnostic battery Arm 2). Because damped geodesics are asymmetric, analogy transport must use directed arcs.
- Native Hallucination Detection: Energy dissipation anomalies and curvature spikes provide mechanistically grounded uncertainty signals computable at inference time without additional parameters. The smooth damping-induced energy decay is normal operation; deviation from the expected dissipation curve flags hallucination. For Fock models, the detector needs a per-model baseline that accounts for the known layer-1 exchange transient.
- Geodesic Semantic Distance: A replacement for cosine similarity that encodes the model's learned energy landscape, expected to outperform cosine on polysemy and cross-basin semantic cases. The geodesic distance is inherently asymmetric: ; a symmetrised variant is available when symmetry is desired.
- Native Chain-of-Thought (via Fock extension): The Fock-PARFLM v2.1 extends this model with register-based native CoT — reasoning steps as Fock register waypoints on damped geodesics, with zero extra token generation. The diagnostic confirms Fock register dynamics are predominantly linear — (Arm 5), supporting the geodesic-waypoint interpretation.
The conservative constraint imposes a PPL cost relative to attention, but the price buys geometric structure and interpretability that attention-based architectures are structurally incapable of providing.
How to Get Started
# Clone the companion repository for full source code
# git clone https://github.com/dimitarpg13/semsimula-paper.git
# cd semsimula-paper/notebooks/conservative_arch
import torch
import sys
sys.path.insert(0, "multixi")
sys.path.insert(0, "energetic_minima")
sys.path.insert(0, "sarf_mass_variant")
from multixi.model_multixi import (
ScalarPotentialLMSARFMassLNMultiXi,
SPLMSARFMassLNMultiXiConfig,
)
config = SPLMSARFMassLNMultiXiConfig(
vocab_size=50257, # GPT-2 BPE
d=256,
n_layers=8,
v_hidden=1024,
v_depth=3,
max_len=1024,
block_size=512,
gamma=0.30,
xi_channels=4,
xi_alpha_inits=[0.25, 0.5, 0.75, 0.95],
xi_learnable=True,
)
model = ScalarPotentialLMSARFMassLNMultiXi(config)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
# Forward pass
x = torch.randint(0, 50257, (1, 64))
logits, loss = model(x, targets=x)
Available Checkpoints
Two trained checkpoints are included in this repository:
| File | Steps | PPL | Description |
|---|---|---|---|
checkpoint/model_16k.pt |
16,000 | 11.51 | Extended schedule (best) |
checkpoint/model_8k.pt |
8,000 | 12.49 | Standard scaleup schedule |
Additional artifacts:
| File | Description |
|---|---|
training_log_16k.jsonl / training_log_8k.jsonl |
Per-step training metrics |
loss_curve_16k.png / loss_curve_8k.png |
Training/validation loss plots |
training_summary_16k.md / training_summary_8k.md |
Hyperparameters and final metrics |
convergence_curves.png |
8k vs 16k convergence comparison |
alpha_evolution.png |
Learned alpha channel evolution |
experiment_report.json |
Full experiment report (JSON) |
To load the best checkpoint (16k steps, PPL 11.51):
from huggingface_hub import hf_hub_download
import torch
ckpt_path = hf_hub_download(
repo_id="dimitarpg13/semsimula-splm-multixi",
filename="checkpoint/model_16k.pt",
)
state = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(state["model_state_dict"])
model.eval()
Training Details
Training Data
TinyStories -- a synthetic corpus of short children's stories generated by GPT-3.5/4, tokenized with GPT-2 BPE (vocab size 50,257). Training cap: 5M tokens; validation: ~140k tokens.
Training Procedure
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate | 5e-4 (cosine decay) |
| Warmup steps | 400 |
| Weight decay | 0.01 |
| Gradient clipping | 1.0 |
| Batch size | 16 |
| Block size | 512 |
| Training steps | 16,000 (extended) / 8,000 (scaleup) |
| Hardware | A100 40GB (Google Colab) |
Training Script
notebooks/conservative_arch/scaleup/train_splm_em_ln_multixi_scaleup.py
Colab Notebook
notebooks/conservative_arch/scaleup/colab_splm_multixi_rerun.ipynb — runs both 8k and 16k arms with live progress display, saves results to Google Drive.
Training Results
notebooks/conservative_arch/scaleup/results/splm_multixi_rerun/ — training logs, loss curves, alpha evolution plots, convergence comparison, and full experiment report.
Evaluation Results
TinyStories Validation Perplexity
| Model | PPL | Params | Gap vs Attention |
|---|---|---|---|
| Matched Attention (baseline) | 7.81 | 19.5M | -- |
| Hybrid SPLM+Attn | 8.50 | ~19.0M | +0.69 |
| Fock-PARFLM v2.1 | 9.30 | 17.4M | +1.49 |
| Fock Attention | 9.42 | 16.7M | +1.61 |
| Multi-Xi PARFLM | 12.06 | 17.6M | +4.25 |
| Multi-Xi SPLM (this model, 16k) | 11.51 | 16.5M | +3.70 |
| Multi-Xi SPLM (8k) | 12.49 | 16.5M | +4.68 |
The Multi-Xi SPLM is the simplest model in the family (no pair forces, no registers, no attention). Extended training from the pilot (4k steps, 14.69 PPL) to 16k steps closes the gap substantially. The remaining PPL gap to attention is attributable to the conservative constraint and the absence of pairwise token interactions -- precisely the structural limitations that PARFLM and Fock extensions are designed to address.
SPLM Family Overview
This model is part of the Semantic Simulation SPLM family -- five conservative-by-construction language model variants exploring different points in the design space between pure scalar-potential dynamics and attention:
| Model | Design | Inference | HuggingFace |
|---|---|---|---|
| Multi-Xi SPLM | Pure scalar potential, K-EMA context | this model | |
| Hybrid SPLM+Attn | Attention front-end + SPLM refinement | semsimula-hybrid-splm | |
| Multi-Xi PARFLM | Scalar potential + sparse pairwise forces | semsimula-parflm-multixi | |
| Fock-PARFLM v2.1 | PARFLM + Fock register pool (mediated exchange) | semsimula-fock-parflm | |
| Fock Attention | PARFLM + direct token-to-token exchange | semsimula-fock-attention |
Bias, Risks, and Limitations
- Research checkpoint only. This model is a proof-of-concept for the conservative language model architecture, not a production system.
- TinyStories only. Trained exclusively on synthetic children's stories (~5M tokens). Not suitable for general-purpose language generation.
- English only. No multilingual capability.
- Small scale. 16.5M parameters, 256-dim hidden states. The architecture's scaling properties to larger dimensions/datasets are unexplored.
- No safety training. No RLHF, DPO, or safety filtering has been applied.
Citation
@misc{Gueorguiev2026SemSim,
author = {Gueorguiev, Dimitar P.},
title = {Semantic Simulation: A Prescriptive Lagrangian Framework
for Efficient Semantic Inference --- A Conservative-by-
Construction Language Model and the Shared-Potential
Separator, with a Correspondence to Joint Embedding
Predictive Architectures},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.19712427},
url = {https://doi.org/10.5281/zenodo.19712427},
note = {Version v15 (Jun 7, 2026).
Companion code repository (DOI 10.5281/zenodo.20579561):
\url{https://github.com/dimitarpg13/semsimula-paper}}
}
Environmental Impact
- Hardware: NVIDIA A100 40GB (Google Colab)
- Training time: ~1.1 hours (16,000 steps)
- Carbon footprint: Estimated < 1 kg CO2 (short training run on cloud GPU)
- Downloads last month
- 394
Dataset used to train dimitarpg13/semsimula-splm-multixi
Collection including dimitarpg13/semsimula-splm-multixi
Evaluation results
- Validation Perplexity on TinyStoriesvalidation set self-reported11.510
