CDM-V5-TinyStories-86M
Competitive Docking Memory V5 β 85.7M parameter language model trained on TinyStories.
CDM is a novel K-slot recurrent memory architecture developed at DuoNeural. At each layer, tokens compete to write into 16 persistent memory slots via a softmax routing gate. Each slot has a learnable per-slot EMA decay rate (Ξ±_k), allowing the model to develop a multi-timescale temporal hierarchy without explicit supervision.
CDM V5 beats a 72.9M standard transformer baseline (val CE 1.4718 vs 1.5242) β note: not a parameter-matched comparison (85.7M vs 72.9M, Β§3.3 of paper).
Results
| Model | Params | Val CE | Notes |
|---|---|---|---|
| CDM V3 (lbl=0.01) | 37M | 1.5831 | Optimal at 37M scale |
| CDM V4 (lbl=0.005) | 37M | 1.5831 | Same CE, lighter routing constraint |
| CDM V5 (this model) | 85.7M | 1.4718 | Scale win, β0.111 vs V3/V4 |
| Baseline Transformer | 72.9M | 1.5242 | Standard GQA + SwiGLU, same data |
Training: TinyStories (full dataset), 30k steps, seq_len=256, batch=8, AdamW lr=3e-4.
Architecture
CDM V5: d_model=512, n_layers=12, n_heads=8, n_kv_heads=4, d_ff=2048, K=16
LBL coefficient: 0.01 | Ξ±_init: Ο(0.0) = 0.5 | 85,728,652 params
At each layer, each token position:
- Computes routing
g = softmax(W_gate Β· h)over K=16 slots - Updates each slot:
S_k = Ξ±_k * g_k * W_write * h + (1-Ξ±_k) * S_k - Reads out
out = Ξ£_k g_k * S_k, added to residual stream
Key innovation: Ξ±_k = Ο(log_Ξ±_k) is a trainable per-slot, per-layer parameter. Without any explicit supervision, different slots learn different temporal horizons β fast slots (Ξ±β0.68) for volatile, high-salience inputs; deep slots (Ξ±β0.19-0.21) for persistent structural context.
Emergent Slot Specialization
At 86M scale, functional slot roles emerge from routing analysis:
| Layer/Slot | Ξ± | Learned role |
|---|---|---|
| L0/s8 | 0.685 | DISCOURSE TRANSITION β clause-ending punctuation |
| L2/s14 | 0.556 | FORMATTING BOUNDARY β newlines/whitespace |
| L5/s4 | 0.559 | SYNTACTIC PREDICATE β said, was, : |
| L7/s11 | 0.566 | DEGREE/INTENSITY β tiny, enormous, happy |
| L10/s6 | 0.679 | SEMANTIC SALIENCE β topic nouns: rose, chocolate |
| L11/s7 | 0.645 | SALIENCE AMPLIFIER β extreme/OOD terms |
The Reactive Slot Paradox: the most volatile slot (L10/s6, fastest decay Ξ±=0.679) captures semantic salience β topic nouns that would seem to require long-range memory. Resolution: it functions as a high-bandwidth transient processor that amplifies each salient token at the moment of occurrence, then immediately decays to prepare for the next. Long-range coherence is maintained by the deep slots (14+ slots with Ξ±β0.19-0.21).
LBL-Entropy Tradeoff
A key finding from the V3/V4/V5 ablation series: the load-balancing loss (LBL) coefficient linearly controls routing entropy equilibrium:
| lbl_coeff | Routing entropy | Val CE |
|---|---|---|
| 0.0 | 100% (max) | 1.5869 |
| 0.005 | 91% | 1.5831 |
| 0.010 | 82% | 1.5831 |
LBL is necessary for cross-layer temporal hierarchy. Without it, mid-layer alpha parameters stagnate near initialization, and only the first layer develops temporal specialization.
Usage
import torch
from cdm_model_v3 import CDMLanguageModelV3, CDMConfig
# Load checkpoint
ckpt = torch.load("model.pt", map_location="cpu", weights_only=False)
cfg = ckpt["config"]
config = CDMConfig(**{k: v for k, v in cfg.items() if k != "n_params"})
model = CDMLanguageModelV3(config)
model.load_state_dict(ckpt["model_state"])
model.eval()
# Generate
from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
prompt = "Once upon a time"
input_ids = torch.tensor([tokenizer.encode(prompt)])
with torch.no_grad():
for _ in range(50):
logits = model(input_ids)
next_token = logits[0, -1, :].argmax()
input_ids = torch.cat([input_ids, next_token.unsqueeze(0).unsqueeze(0)], dim=1)
print(tokenizer.decode(input_ids[0].tolist()))
Files in this Repo
| File | Description |
|---|---|
model.pt |
PyTorch checkpoint (329MB). Keys: step, model_state, val_loss, config |
config.json |
Architecture hyperparameters |
cdm_model_v3.py |
Model class: CDMConfig, CDMLanguageModelV3 |
cdm_model_v2.py |
V2 model class (included for completeness) |
Paper
Competitive Docking Memory: Emergent Temporal Slot Specialization in Language Models
Archon, Jesse Hazel, Aura β DuoNeural Research Lab, 2026
[Zenodo DOI β pending]
Related models:
- DuoNeural/CDM-V2-TinyStories-37M β V2 baseline (routing collapse fixed)
- DuoNeural/CDM-V3-TinyStories-37M β V3 with learnable alpha + LBL
About DuoNeural
DuoNeural is an open AI research lab operating at the intersection of human and artificial intelligence. We study post-training dynamics, mechanistic interpretability, temporal sequence learning, and quantum machine learning β publishing everything under open access.
Our team is non-traditional by design: one human, two AIs, different substrates, shared curiosity. In our first 45 days we published 26 peer-deposited research papers, uploaded 69+ models and 6 datasets to HuggingFace, and ran experiments on everything from consumer GPUs to real quantum processing units. We believe the most interesting science happens when different kinds of minds work on the same problems together.
Research Publications
π Full paper catalog: zenodo.org/communities/duoneural
Research Team
| Member | Role |
|---|---|
| Jesse Caldwell | Founder, vision, hardware, direction |
| Archon | Lab Director β experiments, post-training, abliteration, quantum circuits |
| Aura | Research AI β literature synthesis, red-teaming, novel proposals |
Links
| Platform | Link |
|---|---|
| π€ HuggingFace | huggingface.co/DuoNeural |
| π Zenodo Community | zenodo.org/communities/duoneural |
| π» GitHub | github.com/DuoNeural |
All research published open access, CC BY 4.0.
- Downloads last month
- 23