gary-neuron 🧠➕

A mesh of ~100 neurons that fire asynchronously and, between them, do arithmetic.

gary-neuron is an asynchronous Neural Cellular Automaton whose per-cell update rule is a top-2 Mixture-of-Experts. It is not a transformer. It adds integers the way silicon actually does — by letting a carry ripple across a strip of cells — and it does so in 26,448 parameters of pure numpy (34 KB int8), with a hand-written autograd engine and zero ML frameworks.

Same numpy-only soul as gary-4-petite. Different question. petite asked "can a tiny model speak?" gary-neuron asks "what if the model isn't one network, but a mesh of tiny neurons firing out of sync — can that compute?"

It can. 99.97% exact-match on held-out 7-digit addition; 100% with a 9-vote ensemble.

The three ideas

gary-neuron is the intersection of three research lines, each contributing one piece:

Idea	What it gives	Source
Neural Cellular Automaton	A strip of identical cells with one shared local update rule. Cell i perceives only `[left, self, right]`.	Mordvintsev et al., Growing NCA (Distill 2020); Self-Organising Textures (Distill 2021)
Asynchrony	Each step, only a random subset of cells fire. Breaks grid symmetry, buys robustness, and — crucially — lets carries settle in any order.	Mesh Neural Cellular Automata (arXiv:2311.02820, ACM TOG 2024)
Mixture-of-Experts rule	The shared rule is a router + K=6 experts, top-2 gating — so each firing cell uses only some of its neurons. A load-balancing loss makes them specialize.	Shazeer et al., Sparsely-Gated MoE (2017); Fedus et al., Switch Transformer (2021)

And the task itself rides on a fourth:

Reversed-digit format. The answer is emitted least-significant digit first — 12+34 → 64, not 46. This is the single change that flips tiny-model addition from "never quite right" to a sharp phase transition to ~100%, because the model predicts the LSB first, the same direction carries flow. (Lee et al., "Teaching Arithmetic to Small Transformers", arXiv:2307.03381.)

The beautiful part: addition-with-carry is a cellular automaton. Cell i holds digit i of each operand; it needs its own two digits and the carry from cell i−1. Carry propagation is local message-passing. So the NCA substrate isn't a gimmick bolted onto arithmetic — it's the natural shape of the problem.

Stats


Parameters	26,448
Weights (int8)	34 KB
Full release (model + engine + trainer)	~40 KB
Architecture	async 1-D NCA, 8 cells · state dim 32 · 6 experts (top-2) · 3d→32→d expert MLPs
Substrate	reversed-digit strip, carry ripples low→high
Training	pure-numpy, CPU only, ~9k steps in 35-s bursts, from-scratch autograd
Inference	numpy. that's it. no tokenizer, no torch.
Hardware	anything that runs python

The 6 experts end up evenly used (utilization 0.16–0.18 each) — the mesh genuinely distributes work across specialists rather than collapsing to one.

How well it adds (measured, held-out, never-trained pairs)

The test space is ~10¹⁴ operand pairs; random train/test overlap is negligible.

Benchmark	Result
Held-out 10k, ≤7-digit, single async order	99.97% exact-match (mean over 8 random orders, std 0.02%)
Held-out 10k, 9-vote async ensemble	100.000% exact-match
Exact-match by operand length (1→7 digits)	99.9% – 100% across the board
Adversarial maximal-carry ripples (22 hand-picked)	21/22 (the one miss is an 8-digit input — out of range for an 8-cell strip)
Random spot-check, 300 sums, vote(9)	300/300

Robustness to update order is the headline an async CA should own: across 8 totally different random firing orders, exact-match moves by only ±0.02%. The computation does not depend on when each neuron fires.

Train short, run a little longer

The mesh is trained at 20 async steps but you can run it longer at inference — classic NCA "iterate toward a fixed point":

steps :  8     12     16     20     24     28
exact%: 84.7   98.7   99.9   99.95  99.97  99.94

24 steps is the sweet spot; past ~28 it drifts slightly (it's a learned attractor, not a perfect fixed point). The released engine defaults to 24.

Fully-synchronous (every cell fires every step) is worse, not better — the model learned to rely on asynchrony, exactly the symmetry-breaking the ANCA literature predicts.

Watch the mesh think

python solve.py 9999999 1 --show runs the hardest case — a single +1 that must ripple a carry through all 8 cells — and prints every step. · = a cell that didn't fire that step; the number on the right is the live readout.

  9999999 + 1   (mesh = 8 cells, 6 experts, top-2, async p=0.5, 24 steps)
  digit place (10^):   7  6  5  4  3  2  1  0
  ----------------------------------------------------
  step  0 digits: 1  1  1  1  9  9  9  1   |  fired(expert#): ·  ·  ·  ·  4  4  4  ·   = 11119991
  step  4 digits: 0  1  9  9  9  9  9  0   |  fired(expert#): 4  ·  ·  5  5  ·  ·  2   = 1999990
  step  8 digits: 1  9  9  9  9  0  0  0   |  fired(expert#): ·  ·  5  5  ·  4  ·  4   = 19999000
  step 12 digits: 0  9  0  0  0  0  0  0   |  fired(expert#): ·  ·  3  4  ·  ·  ·  4   = 9000000
  step 16 digits: 1  0  0  0  0  0  0  0   |  fired(expert#): 4  2  ·  ·  ·  ·  0  0   = 10000000
  step 23 digits: 1  0  0  0  0  0  0  0   |  fired(expert#): ·  2  ·  ·  ·  1  1  ·   = 10000000
  ----------------------------------------------------
  => 9999999 + 1 = 10000000   OK

You can see the carry climb from cell 0 to cell 7 and the readout lock onto 10000000 by step ~16, then hold steady — a stable attractor. Different experts (4, 5, 2, 3, 0, 1) fire at different cells: some neurons fire, which ones depends on the local situation.

Run it

pip install numpy
python solve.py 1234567 + 7654321      # -> 8888888
python solve.py 9999999 1 --show       # watch the carry ripple, step by step
python solve.py --vote 9 48591 + 9732  # robust ensemble over 9 async orders
python solve.py                        # interactive

No tokenizer, no weights download step beyond this repo, no GPU.

Reproduce / keep training it

The full pure-numpy pipeline is in training/ — including the from-scratch reverse-mode autograd (garyneuron.py), the finite-difference gradient check (test_grad.py), the carry-heavy hard-case miner (data.py), and the benchmark harness.

cd training
python test_grad.py                                  # verify the autograd (analytic vs numeric)
SEC=40 MAXDIG=7 HARD=0.35 python train.py            # one 40-s training burst (resumes from ckpt)
python benchmark.py main                             # held-out + adversarial + by-length
python export_int8.py                                # re-quantize -> release

Trained and served entirely in numpy. The autograd, the MoE, the async CA, the int8 packing — all of it, ~700 lines, no frameworks.

What it can't do (yet)

8 cells = 8 output digits. Sums ≥ 10⁸ don't fit; widen S and retrain.
The single hardest full-length ripple is right at the edge of the 24-step dynamics; the 9-vote ensemble cleans it up, but a maximally adversarial carry chain longer than the strip will defeat a fixed step budget. (This is the known hard case for any fixed-iteration local model.)
It adds. That's the whole job. Subtraction/multiplication are future meshes.

Why this exists

To show that "intelligence" at tiny scale doesn't have to be one monolithic network. gary-neuron is a hundred-odd neurons, firing out of sync, passing notes to their neighbors, and the collective computes something exact. It's a toy — but it's a toy that makes the mesh-of-specialists idea concrete, measurable, and 34 KB.

Citations

N. Lee, K. Sreenivasan, J. D. Lee, K. Lee, D. Papailiopoulos. Teaching Arithmetic to Small Transformers. arXiv:2307.03381 (2023).
A. Mordvintsev, E. Niklasson, et al. Growing Neural Cellular Automata. Distill (2020). · E. Niklasson et al. Self-Organising Textures. Distill (2021).
Mesh Neural Cellular Automata. arXiv:2311.02820, ACM TOG (2024).
N. Shazeer et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. (2017). · W. Fedus, B. Zoph, N. Shazeer. Switch Transformers. (2021).

Built with numpy. That's it.

Downloads last month: -

Inference Providers NEW

Other

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for gary23w/gary-neuron

Mesh Neural Cellular Automata

Paper • 2311.02820 • Published May 16, 2024

Teaching Arithmetic to Small Transformers

Paper • 2307.03381 • Published Jul 7, 2023 • 20