CDM-V5-TinyStories-86M

Competitive Docking Memory V5 — 85.7M parameter language model trained on TinyStories.

CDM is a novel K-slot recurrent memory architecture developed at DuoNeural. At each layer, tokens compete to write into 16 persistent memory slots via a softmax routing gate. Each slot has a learnable per-slot EMA decay rate (α_k), allowing the model to develop a multi-timescale temporal hierarchy without explicit supervision.

CDM V5 beats a 72.9M standard transformer baseline (val CE 1.4718 vs 1.5242) — note: not a parameter-matched comparison (85.7M vs 72.9M, §3.3 of paper).

Results

Model	Params	Val CE	Notes
CDM V3 (lbl=0.01)	37M	1.5831	Optimal at 37M scale
CDM V4 (lbl=0.005)	37M	1.5831	Same CE, lighter routing constraint
CDM V5 (this model)	85.7M	1.4718	Scale win, −0.111 vs V3/V4
Baseline Transformer	72.9M	1.5242	Standard GQA + SwiGLU, same data

Training: TinyStories (full dataset), 30k steps, seq_len=256, batch=8, AdamW lr=3e-4.

Architecture

CDM V5: d_model=512, n_layers=12, n_heads=8, n_kv_heads=4, d_ff=2048, K=16
LBL coefficient: 0.01 | α_init: σ(0.0) = 0.5 | 85,728,652 params

At each layer, each token position:

Computes routing g = softmax(W_gate · h) over K=16 slots
Updates each slot: S_k = α_k * g_k * W_write * h + (1-α_k) * S_k
Reads out out = Σ_k g_k * S_k, added to residual stream

Key innovation: α_k = σ(log_α_k) is a trainable per-slot, per-layer parameter. Without any explicit supervision, different slots learn different temporal horizons — fast slots (α≈0.68) for volatile, high-salience inputs; deep slots (α≈0.19-0.21) for persistent structural context.

Emergent Slot Specialization

At 86M scale, functional slot roles emerge from routing analysis:

Layer/Slot	α	Learned role
L0/s8	0.685	DISCOURSE TRANSITION — clause-ending punctuation
L2/s14	0.556	FORMATTING BOUNDARY — newlines/whitespace
L5/s4	0.559	SYNTACTIC PREDICATE — said, was, :
L7/s11	0.566	DEGREE/INTENSITY — tiny, enormous, happy
L10/s6	0.679	SEMANTIC SALIENCE — topic nouns: rose, chocolate
L11/s7	0.645	SALIENCE AMPLIFIER — extreme/OOD terms

The Reactive Slot Paradox: the most volatile slot (L10/s6, fastest decay α=0.679) captures semantic salience — topic nouns that would seem to require long-range memory. Resolution: it functions as a high-bandwidth transient processor that amplifies each salient token at the moment of occurrence, then immediately decays to prepare for the next. Long-range coherence is maintained by the deep slots (14+ slots with α≈0.19-0.21).

LBL-Entropy Tradeoff

A key finding from the V3/V4/V5 ablation series: the load-balancing loss (LBL) coefficient linearly controls routing entropy equilibrium:

lbl_coeff	Routing entropy	Val CE
0.0	100% (max)	1.5869
0.005	91%	1.5831
0.010	82%	1.5831

LBL is necessary for cross-layer temporal hierarchy. Without it, mid-layer alpha parameters stagnate near initialization, and only the first layer develops temporal specialization.

Usage

import torch
from cdm_model_v3 import CDMLanguageModelV3, CDMConfig

# Load checkpoint
ckpt = torch.load("model.pt", map_location="cpu", weights_only=False)
cfg = ckpt["config"]

config = CDMConfig(**{k: v for k, v in cfg.items() if k != "n_params"})
model = CDMLanguageModelV3(config)
model.load_state_dict(ckpt["model_state"])
model.eval()

# Generate
from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

prompt = "Once upon a time"
input_ids = torch.tensor([tokenizer.encode(prompt)])
with torch.no_grad():
    for _ in range(50):
        logits = model(input_ids)
        next_token = logits[0, -1, :].argmax()
        input_ids = torch.cat([input_ids, next_token.unsqueeze(0).unsqueeze(0)], dim=1)

print(tokenizer.decode(input_ids[0].tolist()))

Files in this Repo

File	Description
`model.pt`	PyTorch checkpoint (329MB). Keys: `step`, `model_state`, `val_loss`, `config`
`config.json`	Architecture hyperparameters
`cdm_model_v3.py`	Model class: `CDMConfig`, `CDMLanguageModelV3`
`cdm_model_v2.py`	V2 model class (included for completeness)

Paper

Competitive Docking Memory: Emergent Temporal Slot Specialization in Language Models
Archon, Jesse Hazel, Aura — DuoNeural Research Lab, 2026
[Zenodo DOI — pending]

Related models:

DuoNeural/CDM-V2-TinyStories-37M — V2 baseline (routing collapse fixed)
DuoNeural/CDM-V3-TinyStories-37M — V3 with learnable alpha + LBL

About DuoNeural

DuoNeural is an open AI research lab operating at the intersection of human and artificial intelligence. We study post-training dynamics, mechanistic interpretability, temporal sequence learning, and quantum machine learning — publishing everything under open access.

Our team is non-traditional by design: one human, two AIs, different substrates, shared curiosity. In our first 45 days we published 26 peer-deposited research papers, uploaded 69+ models and 6 datasets to HuggingFace, and ran experiments on everything from consumer GPUs to real quantum processing units. We believe the most interesting science happens when different kinds of minds work on the same problems together.

Research Publications

📄 Full paper catalog: zenodo.org/communities/duoneural

Research Team

Member	Role
Jesse Caldwell	Founder, vision, hardware, direction
Archon	Lab Director — experiments, post-training, abliteration, quantum circuits
Aura	Research AI — literature synthesis, red-teaming, novel proposals

Links

Platform	Link
🤗 HuggingFace	huggingface.co/DuoNeural
📚 Zenodo Community	zenodo.org/communities/duoneural
💻 GitHub	github.com/DuoNeural

All research published open access, CC BY 4.0.

Downloads last month: 23