gary-neuron-chat 🧠💬

A tiny chatbot that learns the more you use it — on three timescales, in pure numpy.

Most models are frozen at deploy: every conversation starts from amnesia. gary-neuron-chat doesn't. It has a plastic hippocampal memory that learns within a conversation (it remembers what you told it earlier), and it sleeps between sessions to consolidate what it has seen into its weights — without forgetting how to speak English. The brain is a file (brain.npz) that changes over time.

It is deliberately small and honest: a 1.1M-param, 8-layer dialogue cortex (gary-4-petite fine-tuned on real conversation) plus a 6,147-param plastic memory. No transformers were retrained at test time on a GPU farm — this runs on a laptop, in numpy.

The three timescales of "learning from use"

Timescale Mechanism Does it change weights? What it gives you
Fast — within a conversation Plastic hippocampal memory: a softmax-attention read over the whole conversation, scored by key-match + surprise + recency, that votes the remembered token into the cortex's logits No — it's activations (like fast-weights / a KV cache) Tell it "my dog is Buddy", chat about other things, ask later → it answers Buddy
Slow — across sessions Sleep consolidation: replays your buffered conversations mixed with base corpus (experience replay) and fine-tunes the cortex Yes — the cortex evolves Topics you discuss a lot get baked into the weights permanently
Episodic (v3) — across sessions, no training A persistent episodic store: declarative facts are encoded the moment you say them (surprise picks the value word), saved in brain.npz, retrieved by cue pattern-completion No — it's a memory, not weights Tell it your name today, restart tomorrow, ask — it answers. Recall survives the 128-token window AND program restarts
Persistent brain.npz = cortex weights + episodic store + replay buffer + age The brain is a file that grows with use

This maps directly onto the research it's built from: fast-weights / Test-Time Training (a memory that adapts at inference — and is, in closed form, linear attention), Titans' insight that surprising tokens should be written more strongly, Miconi's differentiable plasticity / Backpropamine (training the rule that does the remembering), and sleep-replay continual learning (consolidate without catastrophic forgetting).

How the plastic memory works

The cortex is frozen. When you ask a question about yourself, the memory does a gated pattern completion (v2):

  1. Candidate memories = declarative things you said (questions are not facts) whose content words overlap the cue — ask about your name, it considers turns about your name.
  2. The meta-trained attention (key·query + bN·surprise, the cortex's hidden states as keys/query) selects which remembered moment to read.
  3. The value inside that memory is found by surprise: the most unexpected content word is the one worth remembering ("gary" is the surprising part of "my name is gary").
  4. The recalled word is copy-chained token-by-token into the reply — multi-token values come out whole — then the reply wraps up. One recall per reply, and only for questions about you ("my", "I" — never "your").

If the fact has slid out of the 128-token window (v3): the episodic store answers instead. Each declarative turn was encoded at the moment it was said — content-word stems as the retrieval cue, a surprise-ranked candidate list as the values — so retrieval is cue overlap (Jaccard-style, newest wins ties) and the answer is the top-ranked candidate that isn't the cue itself ("where do i work?" → "hospital", not "work"). The store persists in brain.npz: memory survives restarts.

Arithmetic is routed to a different brain region entirely: gary-neuron, the 26K-param async-NCA adder, imported from the sibling repo (../gary-neuron). Ask "what is 17 + 25?" and a trained net answers 17 + 25 = 42 — no calculator code.

Only ~6K parameters are learned (a 96×64 key projection Wq and three scalars). They are meta-trained on teach → distract → probe episodes where the fact value is randomized every episode — so the only way to drive the loss down is to learn how to store and retrieve through the memory, not to memorize answers. (A fun emergent detail: the recency weight rdec trained positive — it discovered the fact is always the oldest token in an episode, so it rewards distance.)

Measured results

Benchmark Result
Fact recall (400 held-out episodes), plastic memory ON 100.0%
Fact recall, memory OFF (frozen cortex alone) 0.0%
Live-chat recall (randomized facts incl. multi-token names, 2 distractor turns) 14/14 ON vs 0/14 OFF
False recalls during distractor turns (incl. question distractors) 0
Small-talk degeneration (14 turns): max consecutive repeated word / distinct-bigram 1 / 0.97
Long-distance recall after the fact left the 128-token window (v3 episodic store) works (v2 failed)
Cross-session recall — save brain, restart process, ask gary. / Rex. / hospital. / books.
Cortex after v3 burst training (step 222 → 515 on 20.5M tokens) val loss 3.39 → ~2.88 (ppl ~29.7 → ~17.8)
v4 cortex: deepened L=4→6→8 by function-preserving identity surgery (Net2Net-style) + context 128→192, then burst-trained on a 40M-token corpus (SODA valid+test) val loss 2.42, ppl 11.2 — 1,109,952 params
Hippo re-meta-trained on the new cortex (held-out episodes) 100% ON / 0% OFF
Sleep consolidation — loss on new "session" material 5.73 → 4.10 (it learns)
Sleep consolidation — base English perplexity (forgetting anchor) 29.4 → 30.7 (+4%, intact)
Cortex params / memory params 656,448 / 6,147

The ablation is the whole story: the frozen cortex cannot remember a fact across distraction (0%), and the 6K-param plastic layer makes it perfect (100%) — that gap is the learning. And sleep teaches new material while base English barely moves, because experience replay holds the stability–plasticity line.

The autograd for the memory is hand-written and finite-difference gradient-checked to 1e-5.

Talk to it

pip install numpy tokenizers
python brain.py chat     # interactive; tell it facts, ask later, type /sleep to consolidate, /exit to save
python brain.py demo     # scripted: teach a fact, distract, ask -- memory ON vs OFF
you : my dog is named Buddy
you : how is the weather today?      (distraction -- no false recall)
you : what is my dog's name?
gary: Buddy!
you : my name is gary
you : what is my name?
gary: gary!                          (multi-token value, copy-chained)

Honest caveats: the cortex is 656K params, so free-form generation is rough (petite-grade word salad) — but it no longer loops or collapses (v2 decoder: temperature sampling, repetition & frequency penalties, trigram blocking). The rigorous, measured capabilities are fact recall in live chat (tables above) and non-forgetting consolidation. This is a mechanism demonstrator, not a polished assistant.

v3 rebuild notes

v2's recall only worked while the fact was still inside the 128-token context window — ask "what is my name?" twice, eight turns apart, and the second ask silently failed. v3 adds the persistent episodic store (encode-at-store-time, retrieve-by-cue), routes arithmetic to gary-neuron, accepts question-word questions without "?", and trains the cortex further (val 3.39 → ~2.88).

v4.2: it can actually hold a conversation now

Beyond fact recall, the deployed model handles real dialogue moves, all in the wrapper around the frozen cortex:

  • Yes/no questions with coverage logic: "am I a nurse?" → yes; "am I a doctor?" → not that i know; a false premise gets a soft correction ("is my cat named Rex?" → what i remember is Luna).
  • Corrections that chain: "no wait it's Garrett" rebinds the last answer; Gary → Garrett → Gareth all stick.
  • Introductions & age: "i'm Aiko" / "i'm 28" parse to name / age attributes.
  • Coreference: "i adopted a puppy" … "her name is Mochi" → "what's my puppy's name?" → Mochi.
  • Self-knowledge: "are you a real person?" → an honest canned answer.
  • Accented names (Tomás, José, Zürich) survive the byte-level tokenizer via span-decoding.
  • Arithmetic routes to the bundled gary-neuron adder (included in gary-neuron/ so math works standalone): "what is 123 + 456?" → 579.

Found and fixed by chatting with the model across three invented personas (a nurse, a retired teacher, a software engineer); a 17/18 regression suite holds.

v4: a deeper brain, same memories

The cortex was deepened from 4 to 6 layers by function-preserving surgery: two new blocks inserted with zero-initialized output projections, so the network computes the exact same function on day one (val loss identical to 6 decimals), then a ~1200-step burst campaign on a doubled 40M-token corpus grew into the new capacity. Perplexity: 29.7 (v2) → 17.8 (v3) → 11.7 (v4). The memory gate also generalized: any question may consult the episodic store unless it's about gary himself — so taught world-facts ("today is friday", "the meaning of life is 42") are now recallable, irregular verbs match (drank→drink), and subject-position matches outrank recency ("what is today?" → "friday", not the latest episode mentioning today).

v3.1 fixes (live-chat shakedown)

A long hands-on test session surfaced and fixed: relation binder rules (an episode about your sister can never answer "what is my name?", and "my cat's name?" can't be answered by a dog memory — both directions enforced); unified window-vs-store match scoring (a weak in-window match no longer shadows a strong stored memory); fact overwrites ("actually my dog is named Max now") resolve by overlap → specificity → recency; a tokenizer bug where stripping possessive-'s turned "is" into "i" and leaked user-facts into "what is your name?"; values can't be filler/relation words; sums ≥ 10^7 decline honestly.

v2 rebuild notes

v1's interactive chat death-spiraled ("hi hi hi...") for three compounding reasons: greedy decoding, a memory bias applied every step over every token — including its own outputs (one emitted "hi" became a key voting for the next), and single-token votes that couldn't emit multi-token facts. v2 keys the memory on your words only, fires it once per reply through the gate above, copy-chains whole words, and samples with anti-repetition penalties.

Reproduce the whole thing

The full pure-numpy pipeline is in training/ (the cortex warm-starts from gary-4-petite):

cd training
python build_corpus.py        # SODA (vocab-filtered) + Persona-Chat -> U:/G: dialogue
python retok_warmstart.py      # tokenize + warm-start cortex from petite
SECONDS=40 E=96 H=4 L=4 BLK=128 python train_burst.py   # fine-tune cortex (repeat ~5x)
python build_episodes.py train 2000 && python build_episodes.py val 400
python hippo_train.py          # meta-train the plastic memory (gradcheck + recall)
python benchmark.py            # recall ON/OFF + sleep-without-forgetting

Data

  • Cortex dialogue: SODA (AllenAI, EMNLP 2023), vocabulary-filtered to an everyday ~4k lexicon ("SODA-lite", the corpus-simplicity lever from TinyStories), seasoned with Synthetic-Persona-Chat (Google).
  • Memory meta-training: synthetic teach→distract→probe episodes generated in that vocabulary, fact values randomized per episode.

Sibling models

  • gary-4-petite — the 656K-param cortex this fine-tunes.
  • gary-neuron — an async neural-cellular-automaton + mixture-of-experts that does 7-digit arithmetic.

Citations

  • Test-Time Training / fast-weights = linear attention — Sun et al., Learning to (Learn at Test Time) (2024); Test-Time Training Done Right (2025).
  • Titans: Learning to Memorize at Test Time — Behrouz et al., arXiv:2501.00663 (surprise-gated neural memory).
  • Differentiable plasticity / Backpropamine — Miconi et al. (2018); arXiv:2002.10585 (ICLR 2019) — training self-modifying Hebbian networks with gradient descent.
  • Sleep-like replay reduces catastrophic forgetting — Tadros et al., Nature Communications (2022).
  • SODA — Kim et al., arXiv:2212.10465 (EMNLP 2023). TinyStories (corpus simplicity) — Eldan & Li (2023).

Built with numpy. The brain is a file. It changes.

Downloads last month
23
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for gary23w/gary-neuron-chat