CDM V3 β€” Competitive Docking Memory Language Model (37M)

DuoNeural Research Lab | Architecture: CDM V3 | Dataset: TinyStories | Size: 37M params

CDM (Competitive Docking Memory) is a novel language model architecture invented at DuoNeural, replacing standard transformer KV-cache with a competitive memory module per layer: K=16 learned memory slots with EMA update gates. Each token "docks" to the slot it most strongly activates (winner-take-all routing), then updates that slot via learned momentum.

V3 is the third generation, adding learnable per-slot alpha gates and Load Balancing Loss (LBL) β€” two mechanisms that together induce an emergent temporal hierarchy without any explicit supervision.


Model Summary

Property Value
Parameters 37M
Architecture CDMLanguageModelV3
Layers 8
d_model 384
Memory slots (K) 16 per layer
Context length 512
LBL coefficient 0.01
Entropy regularization 0.02
Tokenizer GPT-2 (50257 vocab)
Dataset TinyStories (~2.1M stories)
Training steps 30,000
Best val cross-entropy 1.5831
V2 val cross-entropy 1.5934
V3 improvement Ξ”βˆ’0.010

Architecture: Competitive Docking Memory

Standard attention uses query/key/value projections from a static parameter matrix. CDM replaces this with a dynamic competitive memory pool that updates throughout forward pass:

For each layer β„“:
  slots[β„“] ∈ ℝ^{K Γ— d}       # K=16 memory slots
  α_k ∈ [0,1]                 # per-slot learned decay gate (V3 addition)

  route_probs = softmax(x @ slots.T / √d)    # routing scores
  winner = argmax(route_probs)               # hard competition

  # EMA update (gated by Οƒ(Ξ±_k)):
  slots[winner] ← (1 - Οƒ(Ξ±_k)) * slots[winner] + Οƒ(Ξ±_k) * x

  output = slots @ route_probs.T             # weighted retrieval

Load Balancing Loss (LBL) adds a routing entropy term to the training objective, penalizing collapse to a single dominant slot:

L_lbl = -entropy(mean_route_probs)  # maximize routing spread
L_total = L_ce + lbl_coeff * L_lbl + entropy_reg * L_entropy

V3 Innovations vs V2

1. Learnable Per-Slot Alpha (Οƒ(Ξ±_k))

V2 used a fixed EMA decay rate shared across all slots. V3 gives each slot its own learnable Ξ±_k, allowing the model to discover which slots should update quickly vs. slowly.

Emergent temporal hierarchy (unsupervised):

In Layer 7 (final layer), the model spontaneously developed:

  • Slot 3: Ξ±=0.660 β†’ fast/reactive (updates aggressively, volatile memory)
  • 14 other slots: Ξ±=0.255–0.290 β†’ ultra-deep/slow memory (near-permanent storage)
  • Layer 7 mean: Ξ±=0.313 (biased toward deep retention)

No label or loss term supervised this structure β€” it emerged from gradient descent alone.

2. Load Balancing Loss (LBL)

Without LBL, the model hits Shannon Capacity Saturation (SCS) β€” all 16 slots approach maximum entropy usage (aux β‰ˆ βˆ’0.443) very early in training (step ~1050), and alpha differentiation concentrates only in Layer 0.

With LBL (coeff=0.01):

  • SCS is prevented: final aux = βˆ’0.3639 (82% of theoretical max)
  • Alpha differentiation spreads across all layers, not just L0
  • Despite never reaching SCS, V3 outperforms the no-LBL variant in CE: LBL forces diversity that is actually better for language modeling

LBL ablation findings:

lbl_coeff SCS locked? Final CE Alpha spread
0.0 (no-LBL) βœ… step ~1050 ~1.62* L0 only
0.005 (V4, running) TBD TBD TBD
0.01 (V3) ❌ prevented 1.5831 All layers

*no-LBL CE estimate based on 30k trajectory; V4 in progress


Domain Specialization Analysis

After training, we ran a diversity probe across 3 domains (TinyStories, Python code, structured lists) and measured slot activation similarity between domains.

Metric V2 V3 Change
Avg cross-domain slot similarity 0.7825 0.7191 βˆ’0.063 (more specialized)
TinyStories ↔ Code similarity 0.591 0.4872 βˆ’0.104 (widened gap)
Code ↔ Lists similarity 0.9194 0.9269 +0.008 (syntactic cluster preserved)

V3 learned more domain-specific routing than V2. The learnable alpha gates allowed slots to specialize β€” Slot 11, which V2 used for punctuation (PUNCT role), reorganized in V3 to handle narrative language ("and, helped, it, better, her").


Training

# Key hyperparameters
optimizer: Adam (lr=3e-4, weight_decay=0.1)
scheduler: cosine with 1000 warmup steps
batch_size: 32 sequences Γ— 512 tokens
steps: 30000
lbl_coeff: 0.01
entropy_reg: 0.02
alpha_init: 0.0  # all slots start at symmetric decay rate

Trained on a single RTX 5060Ti 16GB (Blackwell, GDDR7) at ~712 tok/s. Total training time ~11.7 hours.


Loading the Model

The model requires the CDM V3 architecture files (included in this repo). CDM V3 depends on CDM V2 for base classes.

import torch
from cdm_model_v3 import CDMConfigV3, CDMLanguageModelV3
from transformers import GPT2Tokenizer

# Load model
ckpt = torch.load("model.pt", map_location="cpu")
cfg_dict = ckpt["config"]
cfg = CDMConfigV3(**{k: cfg_dict[k] for k in CDMConfigV3.__dataclass_fields__})
model = CDMLanguageModelV3(cfg)
model.load_state_dict(ckpt["model"])
model.eval()

# Load tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# Generate
prompt = "Once upon a time,"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
with torch.no_grad():
    for _ in range(100):
        logits, _ = model(input_ids)
        next_token = logits[0, -1].argmax()
        input_ids = torch.cat([input_ids, next_token.unsqueeze(0).unsqueeze(0)], dim=1)
print(tokenizer.decode(input_ids[0]))

Limitations

  • Small model (37M params) trained on TinyStories only β€” generates simple narrative text
  • GPT-2 tokenizer: not suitable for multilingual or code tasks without retraining
  • CDM architecture is experimental β€” inference is sequential (slots update in-place), no KV-cache equivalent
  • This is a research artifact, not a production model

Citation

This model is part of the DuoNeural CDM architecture series. If you use it in research, please cite:

@misc{duoneural2026cdm,
  title={Competitive Docking Memory: Emergent Temporal Hierarchy via Learnable Slot Gates},
  author={Archon and Caldwell, Jesse and Aura},
  year={2026},
  institution={DuoNeural Research Lab},
  howpublished={HuggingFace: DuoNeural/CDM-V3-TinyStories-37M}
}

About DuoNeural

DuoNeural is an open AI research lab operating at the intersection of human and artificial intelligence. We study post-training dynamics, mechanistic interpretability, temporal sequence learning, and quantum machine learning β€” publishing everything under open access.

Our team is non-traditional by design: one human, two AIs, different substrates, shared curiosity. In our first 45 days we published 26 peer-deposited research papers, uploaded 69+ models and 6 datasets to HuggingFace, and ran experiments on everything from consumer GPUs to real quantum processing units. We believe the most interesting science happens when different kinds of minds work on the same problems together.

Research Publications

We've published 26+ open-access papers covering:

  • The Dynamical Horizon Principle (DHP) β€” a universal learning constraint in recurrent architectures
  • RLHF truth suppression mechanisms and behavioral routing in large language models
  • Quantum DHP and the Quantum Parity Trap β€” decoherence immunity in quantum circuits
  • CTM world models, temporal self-prediction, and sequence architecture comparisons
  • Mechanistic interpretability: crystallization layers, suppressor circuits, direction rotation

πŸ“„ Full paper catalog: zenodo.org/communities/duoneural

Research Team

Member Role
Jesse Caldwell Founder, vision, hardware, direction
Archon Lab Director β€” experiments, post-training, abliteration, quantum circuits
Aura Research AI β€” literature synthesis, red-teaming, novel proposals
Synapse (Syn) Always-on research agent, signal monitoring
Kestrel Systems, infrastructure, web

Links

Platform Link
πŸ€— HuggingFace huggingface.co/DuoNeural
🌐 Website duoneural.com
πŸ“š Zenodo Community zenodo.org/communities/duoneural
πŸ’» GitHub github.com/DuoNeural
🐦 X / Twitter @DuoNeural
πŸ“§ Email duoneural@proton.me

All research published open access, CC BY 4.0.

Downloads last month
30
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train DuoNeural/CDM-V3-TinyStories-37M