Transformer 70M (enwik8 byte-level) — structural-comparison baseline

A 70M-parameter byte-level Transformer language model trained on enwik8 for structural comparison with Memory-NLS at matched architectural shape.

This model exists for structural differentiation, not benchmark competition. It is included in the qrv0/mnsm repository as the contrast against which the Memory-NLS architecture's structural anti-collapse property is empirically demonstrated.

What this model exhibits

During the 50,000-step training run, this model:

Reached a low validation minimum (val_ppl 2.54 at step 22,500, 45% of training)
Catastrophically collapsed at step 28,000–34,000: validation perplexity spiked from 3.10 to 27.17 (an 8.8× degradation in 5,000 steps)
Recovered partially through the remaining steps but never returned to its pre-crash minimum
Ended at val_ppl 4.87 — worse than its mid-training minimum and worse than the matched-shape Memory-NLS model (val_ppl 4.27)

The collapse is consistent with the structural-realist prediction: architectures without explicit anti-collapse mechanism are vulnerable to catastrophic loss of representational capacity during sustained training. Engineering patches (skip connections, layer normalization, gradient clipping, learning rate scheduling) defer this failure but do not remove it.

See results/08-optimization-collapse-empirical.md for the full structural finding.

Architecture

Property	Value
Parameters	71,863,296
`d_model`	768
`n_layers`	10
`n_heads`	12
`ffn_mult`	4
`max_seq_len`	1024
`vocab_size`	256 (byte-level)

Standard pre-norm Transformer with multi-head causal self-attention and feedforward MLP blocks. No rotary positional embeddings, RMSNorm, SwiGLU, or other modern attention engineering — kept architecturally parallel to the Memory-NLS comparison.

Training

Identical infrastructure to Memory-NLS:

Dataset: enwik8 (~100MB Wikipedia byte stream)
Steps: 50,000
Sequence length: 1024
Batch size: 8
Optimizer: AdamW, β=(0.9, 0.95), weight decay 0.01
Learning rate: cosine schedule 3e-4 → 3e-5, 500 warmup steps
Precision: bfloat16 mixed
Hardware: NVIDIA RTX 4060 Laptop GPU
Wall time: 3.2 hours
Random seed: 42

Usage

import json
import importlib.util
import torch
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file

REPO = "qvr0/mnsm-transformer-70m-enwik8"

config_path = hf_hub_download(REPO, "config.json")
weights_path = hf_hub_download(REPO, "model.safetensors")
modeling_path = hf_hub_download(REPO, "modeling.py")

spec = importlib.util.spec_from_file_location("modeling", modeling_path)
modeling = importlib.util.module_from_spec(spec)
spec.loader.exec_module(modeling)

with open(config_path) as f:
    config_dict = json.load(f)

model = modeling.TransformerLanguageModel(modeling.TransformerConfig(**config_dict))
state = load_file(weights_path)
model.load_state_dict(state)
model.eval()

prompt = "The history of "
input_ids = torch.tensor([list(prompt.encode("utf-8"))])
out = model.generate(input_ids, max_new_tokens=200, temperature=0.8, top_k=40)
print(bytes(out[0].tolist()).decode("utf-8", errors="replace"))

Final evaluation

Metric	Value
Final validation perplexity	4.87
Min validation perplexity	2.54 (at step 22,500, 45% of training, pre-crash)
Final train loss	1.5121
Final val loss	1.5825
Catastrophic collapse	Step 28,000–34,000, peak val_ppl 27.17

Citation

@misc{mnsm,
  title  = {Memory-Nonlinear State Models: A Memory-Augmented Nonlinear Schrödinger
            Field Equation with State Space Model Correspondence},
  author = {qrv0},
  year   = {2026},
  url    = {https://github.com/qrv0/mnsm},
  note   = {Three structural principles, one equation, seven cross-domain instantiations.}
}

Full repository: https://github.com/qrv0/mnsm
Companion Memory-NLS model: https://huggingface.co/qvr0/mnsm-memnls-70m-enwik8
Structural finding documentation: https://github.com/qrv0/mnsm/blob/main/results/08-optimization-collapse-empirical.md
License: MIT (code) + CC BY 4.0 (documentation)

Downloads last month: 2

Safetensors

Model size

71.9M params

Tensor type

F32

qrv0
/

mnsm-transformer-70m-enwik8