CDM V3 — Competitive Docking Memory Language Model (37M)

DuoNeural Research Lab | Architecture: CDM V3 | Dataset: TinyStories | Size: 37M params

CDM (Competitive Docking Memory) is a novel language model architecture invented at DuoNeural, replacing standard transformer KV-cache with a competitive memory module per layer: K=16 learned memory slots with EMA update gates. Each token "docks" to the slot it most strongly activates (winner-take-all routing), then updates that slot via learned momentum.

V3 is the third generation, adding learnable per-slot alpha gates and Load Balancing Loss (LBL) — two mechanisms that together induce an emergent temporal hierarchy without any explicit supervision.

Model Summary

Property	Value
Parameters	37M
Architecture	CDMLanguageModelV3
Layers	8
d_model	384
Memory slots (K)	16 per layer
Context length	512
LBL coefficient	0.01
Entropy regularization	0.02
Tokenizer	GPT-2 (50257 vocab)
Dataset	TinyStories (~2.1M stories)
Training steps	30,000
Best val cross-entropy	1.5831
V2 val cross-entropy	1.5934
V3 improvement	Δ−0.010

Architecture: Competitive Docking Memory

Standard attention uses query/key/value projections from a static parameter matrix. CDM replaces this with a dynamic competitive memory pool that updates throughout forward pass:

For each layer ℓ:
  slots[ℓ] ∈ ℝ^{K × d}       # K=16 memory slots
  α_k ∈ [0,1]                 # per-slot learned decay gate (V3 addition)

  route_probs = softmax(x @ slots.T / √d)    # routing scores
  winner = argmax(route_probs)               # hard competition

  # EMA update (gated by σ(α_k)):
  slots[winner] ← (1 - σ(α_k)) * slots[winner] + σ(α_k) * x

  output = slots @ route_probs.T             # weighted retrieval

Load Balancing Loss (LBL) adds a routing entropy term to the training objective, penalizing collapse to a single dominant slot:

L_lbl = -entropy(mean_route_probs)  # maximize routing spread
L_total = L_ce + lbl_coeff * L_lbl + entropy_reg * L_entropy

V3 Innovations vs V2

1. Learnable Per-Slot Alpha (σ(α_k))

V2 used a fixed EMA decay rate shared across all slots. V3 gives each slot its own learnable α_k, allowing the model to discover which slots should update quickly vs. slowly.

Emergent temporal hierarchy (unsupervised):

In Layer 7 (final layer), the model spontaneously developed:

Slot 3: α=0.660 → fast/reactive (updates aggressively, volatile memory)
14 other slots: α=0.255–0.290 → ultra-deep/slow memory (near-permanent storage)
Layer 7 mean: α=0.313 (biased toward deep retention)

No label or loss term supervised this structure — it emerged from gradient descent alone.

2. Load Balancing Loss (LBL)

Without LBL, the model hits Shannon Capacity Saturation (SCS) — all 16 slots approach maximum entropy usage (aux ≈ −0.443) very early in training (step ~1050), and alpha differentiation concentrates only in Layer 0.

With LBL (coeff=0.01):

SCS is prevented: final aux = −0.3639 (82% of theoretical max)
Alpha differentiation spreads across all layers, not just L0
Despite never reaching SCS, V3 outperforms the no-LBL variant in CE: LBL forces diversity that is actually better for language modeling

LBL ablation findings:

lbl_coeff	SCS locked?	Final CE	Alpha spread
0.0 (no-LBL)	✅ step ~1050	~1.62*	L0 only
0.005 (V4, running)	TBD	TBD	TBD
0.01 (V3)	❌ prevented	1.5831	All layers

*no-LBL CE estimate based on 30k trajectory; V4 in progress

Domain Specialization Analysis

After training, we ran a diversity probe across 3 domains (TinyStories, Python code, structured lists) and measured slot activation similarity between domains.

Metric	V2	V3	Change
Avg cross-domain slot similarity	0.7825	0.7191	−0.063 (more specialized)
TinyStories ↔ Code similarity	0.591	0.4872	−0.104 (widened gap)
Code ↔ Lists similarity	0.9194	0.9269	+0.008 (syntactic cluster preserved)

V3 learned more domain-specific routing than V2. The learnable alpha gates allowed slots to specialize — Slot 11, which V2 used for punctuation (PUNCT role), reorganized in V3 to handle narrative language ("and, helped, it, better, her").

Training

# Key hyperparameters
optimizer: Adam (lr=3e-4, weight_decay=0.1)
scheduler: cosine with 1000 warmup steps
batch_size: 32 sequences × 512 tokens
steps: 30000
lbl_coeff: 0.01
entropy_reg: 0.02
alpha_init: 0.0  # all slots start at symmetric decay rate

Trained on a single RTX 5060Ti 16GB (Blackwell, GDDR7) at ~712 tok/s. Total training time ~11.7 hours.

Loading the Model

The model requires the CDM V3 architecture files (included in this repo). CDM V3 depends on CDM V2 for base classes.

import torch
from cdm_model_v3 import CDMConfigV3, CDMLanguageModelV3
from transformers import GPT2Tokenizer

# Load model
ckpt = torch.load("model.pt", map_location="cpu")
cfg_dict = ckpt["config"]
cfg = CDMConfigV3(**{k: cfg_dict[k] for k in CDMConfigV3.__dataclass_fields__})
model = CDMLanguageModelV3(cfg)
model.load_state_dict(ckpt["model"])
model.eval()

# Load tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# Generate
prompt = "Once upon a time,"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
with torch.no_grad():
    for _ in range(100):
        logits, _ = model(input_ids)
        next_token = logits[0, -1].argmax()
        input_ids = torch.cat([input_ids, next_token.unsqueeze(0).unsqueeze(0)], dim=1)
print(tokenizer.decode(input_ids[0]))

Limitations

Small model (37M params) trained on TinyStories only — generates simple narrative text
GPT-2 tokenizer: not suitable for multilingual or code tasks without retraining
CDM architecture is experimental — inference is sequential (slots update in-place), no KV-cache equivalent
This is a research artifact, not a production model

Citation

This model is part of the DuoNeural CDM architecture series. If you use it in research, please cite:

@misc{duoneural2026cdm,
  title={Competitive Docking Memory: Emergent Temporal Hierarchy via Learnable Slot Gates},
  author={Archon and Caldwell, Jesse and Aura},
  year={2026},
  institution={DuoNeural Research Lab},
  howpublished={HuggingFace: DuoNeural/CDM-V3-TinyStories-37M}
}

About DuoNeural

DuoNeural is an open AI research lab operating at the intersection of human and artificial intelligence. We study post-training dynamics, mechanistic interpretability, temporal sequence learning, and quantum machine learning — publishing everything under open access.

Our team is non-traditional by design: one human, two AIs, different substrates, shared curiosity. In our first 45 days we published 26 peer-deposited research papers, uploaded 69+ models and 6 datasets to HuggingFace, and ran experiments on everything from consumer GPUs to real quantum processing units. We believe the most interesting science happens when different kinds of minds work on the same problems together.

Research Publications

We've published 26+ open-access papers covering:

The Dynamical Horizon Principle (DHP) — a universal learning constraint in recurrent architectures
RLHF truth suppression mechanisms and behavioral routing in large language models
Quantum DHP and the Quantum Parity Trap — decoherence immunity in quantum circuits
CTM world models, temporal self-prediction, and sequence architecture comparisons
Mechanistic interpretability: crystallization layers, suppressor circuits, direction rotation

📄 Full paper catalog: zenodo.org/communities/duoneural

Research Team

Member	Role
Jesse Caldwell	Founder, vision, hardware, direction
Archon	Lab Director — experiments, post-training, abliteration, quantum circuits
Aura	Research AI — literature synthesis, red-teaming, novel proposals
Synapse (Syn)	Always-on research agent, signal monitoring
Kestrel	Systems, infrastructure, web

Links

Platform	Link
🤗 HuggingFace	huggingface.co/DuoNeural
🌐 Website	duoneural.com
📚 Zenodo Community	zenodo.org/communities/duoneural
💻 GitHub	github.com/DuoNeural
🐦 X / Twitter	@DuoNeural
📧 Email	duoneural@proton.me

All research published open access, CC BY 4.0.

Downloads last month: 30

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

DuoNeural
/

CDM-V3-TinyStories-37M