NCA 3D Brain β Neural Cellular Automata for Language
A radically different neural architecture: a 3D grid of "mini-neurons" that learns language through local wave propagation, inspired by protein folding and biological neural communication.
Information propagating as waves through the 3D grid β from input (z=0) to output (z=15)
What is this?
A cube of 4,096 "mini-neurons" (16Γ16Γ16 grid) where each cell only communicates with its immediate neighbors. Information travels as waves through the cube. No attention mechanism (unlike Transformers), no sequential layers β it's a mathematical organism where language understanding emerges from local signal propagation.
35.4M parameters. Generates coherent English phrases of 6+ words.
This is not a Transformer. This is not an RNN. This is a 3D cellular automaton that learned to speak.
Key Results
| Metric | Value |
|---|---|
| Eval word accuracy | 10.7% (over 30K vocabulary) |
| Train word accuracy | 16.9% |
| Train/eval gap | 1.6x (best generalization of all versions) |
| Longest coherent output | "she started to play together again" (6 words) |
| Parameters | 35.4M |
| Model size | 68 MB |
| Grid | 16Γ16Γ16 = 4,096 cells |
| Cell dimension | 256 |
| Training | 16 epochs, ~10.7h on NVIDIA B200 |
Grammar emergence
The model learned grammatical categories without explicit supervision:
"the" + _ β nouns/adjectives (correct category)
"the big" + _ β nouns ("dog", "house", "girl")
"the dog" + _ β verbs ("was", "ran", "had")
"she wanted" + _ β "to" (infinitive structure)
Best generations
"she started to play together again"
"the little girl wanted to play with her parents"
"he said that he was very happy"
"in the morning she went to the garden"
Architecture
INPUT (face z=0) THINKING (interior) OUTPUT (face z=15)
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Tokens are β βββ β 3D waves β βββ β Prediction β
β injected β waves β propagate β waves β is read β
β into skin β β N steps β β from pole β
βββββββββββββββ βββββββββββββββ βββββββββββββββ
How it works
- Injection: Input tokens are embedded and injected into the z=0 face of the cube
- Propagation: For N steps, each cell updates based on its 26 neighbors via Conv3d
- Reading: The opposite face (z=15) is average-pooled and projected to vocabulary logits
Key innovations
- Dilated convolutions with cycle [1, 2, 4, 8] β in 4 steps each cell "sees" the entire grid
- Synaptic fatigue: 2 dedicated channels that inhibit over-firing cells, preventing repetition
- Dual chemical pathways: Standard Conv3d + depthwise-separable Conv3d (diversity in transition rules)
- Gated residual updates: Each cell decides how much to change per step via sigmoid gate
Emergent phenomena
3D brain map showing where "good" sentences activate more (red) vs "bad" sentences (blue)
- Functional hemispheres: x=12 region produces better language than x=6
- 3 thinking phases: chaos (steps 1-5) β eureka (6-7) β decision (8-15)
- Grammar in center, semantics in periphery of the grid
- Semantic clustering: animals, family, nature, objects form distinct spatial clusters
- Emotion highway: emotional content activates a specific depth layer (z=12)
Channel specialization: grammar channels (red) vs semantic channels (blue) β the model spontaneously separated syntax from meaning
Quick Start
import torch
import torch.nn.functional as F
import json
from model import NCA3D_Fatigue
# Load dictionary
word2num = {k: int(v) for k, v in json.load(open("word_dictionary_30k.json")).items()}
num2word = {v: k for k, v in word2num.items()}
# Load model
model = NCA3D_Fatigue()
model.load_state_dict(torch.load("model_phase4c_v5_fatigue_best.pt", map_location="cpu"))
model.eval()
# Predict next word
context = ["the", "little", "girl"]
ids = [word2num[w] for w in context]
with torch.no_grad():
logits = model(torch.tensor([ids]), n_steps=15)
pred_id = logits.argmax(-1).item()
print(f"'{' '.join(context)}' β '{num2word.get(pred_id, '?')}'")
# Generate a sequence
from inference import generate
print(generate(model, word2num, num2word, ["she", "wanted", "to"], max_words=8))
Model Architecture Details
Component Shape Params
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
word_embed Embedding(30006, 384) 11.5M
embed_proj Linear(384, 256) 98K
pos_embed Embedding(52, 256) 13K
init_state (1, 256, 16, 16, 16) 1.05M
trans1.conv1 (dilated) Conv3d(256β512, k=3Β³) 3.5M
trans1.conv2 (dilated) Conv3d(512β256, k=3Β³) 3.5M
trans2.dw_conv (dilated) Conv3d(256β256, k=3Β³, groups) 6.9K
trans2.pw_conv Conv3d(256β256, k=1) 65K
gate_conv Conv3d(256β256, k=1) 65K
norm (GroupNorm) 32 groups, 256 ch 512
out_proj 256β512β30006 15.6M
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
TOTAL ~35.4M
Training Details
- Base model: Continued from v4 (dilated Conv3d + 30K vocab)
- Datasets: WikiText-103, TinyStories, BookCorpus, IMDB, ROCStories, CNN/DailyMail, TriviaQA, Natural Questions, ELI5 (10 datasets total)
- Schedule: 3 phases β 8epΓ750K (aggressive) + 5epΓ500K (consolidation) + 3epΓ350K (refinement)
- Learning rates: 5e-4β3e-4 | 2e-4β1e-4 | 8e-5β3e-5
- Hardware: NVIDIA B200, 178GB VRAM peak
- Training time: ~10.7 hours
- Loss: Cross-entropy on 30K word vocabulary, multi-step loss (steps 7-16)
Project History
This model is the result of an extensive research journey:
| Phase | What | Result |
|---|---|---|
| 1 | Arithmetic (8Β³ grid, 499K params) | 98.4% on unseen data |
| 2A | 15 semantic relations | 98.2% test, 87.5% generalization |
| 2B | 100 semantic relations | 73.4% test (85.5% without "similar") |
| 2B-v3 | 184 relations (grammar + semantics) | 93.5% overall |
| 3B | Q&A from relations | 85% direct, 75% novel |
| 3C | Transitive reasoning | 52.5% holdout, 83.3% novel chains |
| 4 | Language as arithmetic | 50.2% char accuracy, grammar emerges |
| 4B | Multi-step loss | 55.4% char accuracy |
| 4C-v1βv4 | Word embeddings, dilated conv, 30K vocab | Incremental improvements |
| 4C-v5 | Synaptic fatigue + intensive training | 10.7% eval, 6+ word coherence |
113 documented discoveries across all phases.
Why this matters
Transformers dominate NLP, but they have fundamental constraints:
- O(nΒ²) attention complexity
- Fixed depth (always N layers regardless of problem difficulty)
- No spatial locality between neurons
- Billions of parameters required
NCA 3D Brain shows that local communication + iterative propagation can produce language-like behavior with:
- O(n) complexity (each cell only sees neighbors)
- Variable thinking depth (more steps = more reasoning)
- Spatial structure with emergent functional zones
- Orders of magnitude fewer parameters
This is early-stage research. The model doesn't compete with Transformers on quality β but it demonstrates that a fundamentally different computational paradigm can learn language structure.
Limitations
- Accuracy is low compared to any Transformer (10.7% next-word prediction on 30K vocab)
- Autoregressive generation accumulates errors β quality degrades after 6-8 words
- Embeddings are partially disorganized (Zipf's law β rare words get few updates)
- No extrapolation to longer contexts than trained on
- CPU inference only (no optimized CUDA kernels)
Citation
@misc{quintela2026nca3d,
title={NCA 3D Brain: Neural Cellular Automata for Language Processing},
author={Cristian Quintela},
year={2026},
url={https://huggingface.co/killking69/nca3d-brain-v5}
}
Author
Cristian Quintela
License
MIT
Evaluation results
- Eval Word Accuracy (30K vocab) on Multi-dataset (TinyStories, WikiText, BookCorpus, IMDB, Q&A)self-reported10.700
- Train Word Accuracy (30K vocab) on Multi-dataset (TinyStories, WikiText, BookCorpus, IMDB, Q&A)self-reported16.900