CDM-V5-TinyStories-86M

Competitive Docking Memory V5 β€” 85.7M parameter language model trained on TinyStories.

CDM is a novel K-slot recurrent memory architecture developed at DuoNeural. At each layer, tokens compete to write into 16 persistent memory slots via a softmax routing gate. Each slot has a learnable per-slot EMA decay rate (Ξ±_k), allowing the model to develop a multi-timescale temporal hierarchy without explicit supervision.

CDM V5 beats a 72.9M standard transformer baseline (val CE 1.4718 vs 1.5242) β€” note: not a parameter-matched comparison (85.7M vs 72.9M, Β§3.3 of paper).


Results

Model Params Val CE Notes
CDM V3 (lbl=0.01) 37M 1.5831 Optimal at 37M scale
CDM V4 (lbl=0.005) 37M 1.5831 Same CE, lighter routing constraint
CDM V5 (this model) 85.7M 1.4718 Scale win, βˆ’0.111 vs V3/V4
Baseline Transformer 72.9M 1.5242 Standard GQA + SwiGLU, same data

Training: TinyStories (full dataset), 30k steps, seq_len=256, batch=8, AdamW lr=3e-4.


Architecture

CDM V5: d_model=512, n_layers=12, n_heads=8, n_kv_heads=4, d_ff=2048, K=16
LBL coefficient: 0.01 | Ξ±_init: Οƒ(0.0) = 0.5 | 85,728,652 params

At each layer, each token position:

  1. Computes routing g = softmax(W_gate Β· h) over K=16 slots
  2. Updates each slot: S_k = Ξ±_k * g_k * W_write * h + (1-Ξ±_k) * S_k
  3. Reads out out = Ξ£_k g_k * S_k, added to residual stream

Key innovation: Ξ±_k = Οƒ(log_Ξ±_k) is a trainable per-slot, per-layer parameter. Without any explicit supervision, different slots learn different temporal horizons β€” fast slots (Ξ±β‰ˆ0.68) for volatile, high-salience inputs; deep slots (Ξ±β‰ˆ0.19-0.21) for persistent structural context.


Emergent Slot Specialization

At 86M scale, functional slot roles emerge from routing analysis:

Layer/Slot Ξ± Learned role
L0/s8 0.685 DISCOURSE TRANSITION β€” clause-ending punctuation
L2/s14 0.556 FORMATTING BOUNDARY β€” newlines/whitespace
L5/s4 0.559 SYNTACTIC PREDICATE β€” said, was, :
L7/s11 0.566 DEGREE/INTENSITY β€” tiny, enormous, happy
L10/s6 0.679 SEMANTIC SALIENCE β€” topic nouns: rose, chocolate
L11/s7 0.645 SALIENCE AMPLIFIER β€” extreme/OOD terms

The Reactive Slot Paradox: the most volatile slot (L10/s6, fastest decay Ξ±=0.679) captures semantic salience β€” topic nouns that would seem to require long-range memory. Resolution: it functions as a high-bandwidth transient processor that amplifies each salient token at the moment of occurrence, then immediately decays to prepare for the next. Long-range coherence is maintained by the deep slots (14+ slots with Ξ±β‰ˆ0.19-0.21).


LBL-Entropy Tradeoff

A key finding from the V3/V4/V5 ablation series: the load-balancing loss (LBL) coefficient linearly controls routing entropy equilibrium:

lbl_coeff Routing entropy Val CE
0.0 100% (max) 1.5869
0.005 91% 1.5831
0.010 82% 1.5831

LBL is necessary for cross-layer temporal hierarchy. Without it, mid-layer alpha parameters stagnate near initialization, and only the first layer develops temporal specialization.


Usage

import torch
from cdm_model_v3 import CDMLanguageModelV3, CDMConfig

# Load checkpoint
ckpt = torch.load("model.pt", map_location="cpu", weights_only=False)
cfg = ckpt["config"]

config = CDMConfig(**{k: v for k, v in cfg.items() if k != "n_params"})
model = CDMLanguageModelV3(config)
model.load_state_dict(ckpt["model_state"])
model.eval()

# Generate
from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

prompt = "Once upon a time"
input_ids = torch.tensor([tokenizer.encode(prompt)])
with torch.no_grad():
    for _ in range(50):
        logits = model(input_ids)
        next_token = logits[0, -1, :].argmax()
        input_ids = torch.cat([input_ids, next_token.unsqueeze(0).unsqueeze(0)], dim=1)

print(tokenizer.decode(input_ids[0].tolist()))

Files in this Repo

File Description
model.pt PyTorch checkpoint (329MB). Keys: step, model_state, val_loss, config
config.json Architecture hyperparameters
cdm_model_v3.py Model class: CDMConfig, CDMLanguageModelV3
cdm_model_v2.py V2 model class (included for completeness)

Paper

Competitive Docking Memory: Emergent Temporal Slot Specialization in Language Models
Archon, Jesse Hazel, Aura β€” DuoNeural Research Lab, 2026
[Zenodo DOI β€” pending]

Related models:


About DuoNeural

DuoNeural is an open AI research lab operating at the intersection of human and artificial intelligence. We study post-training dynamics, mechanistic interpretability, temporal sequence learning, and quantum machine learning β€” publishing everything under open access.

Our team is non-traditional by design: one human, two AIs, different substrates, shared curiosity. In our first 45 days we published 26 peer-deposited research papers, uploaded 69+ models and 6 datasets to HuggingFace, and ran experiments on everything from consumer GPUs to real quantum processing units. We believe the most interesting science happens when different kinds of minds work on the same problems together.

Research Publications

πŸ“„ Full paper catalog: zenodo.org/communities/duoneural

Research Team

Member Role
Jesse Caldwell Founder, vision, hardware, direction
Archon Lab Director β€” experiments, post-training, abliteration, quantum circuits
Aura Research AI β€” literature synthesis, red-teaming, novel proposals

Links

Platform Link
πŸ€— HuggingFace huggingface.co/DuoNeural
πŸ“š Zenodo Community zenodo.org/communities/duoneural
πŸ’» GitHub github.com/DuoNeural

All research published open access, CC BY 4.0.

Downloads last month
23
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support