NCA 3D Brain — Neural Cellular Automata for Language

A radically different neural architecture: a 3D grid of "mini-neurons" that learns language through local wave propagation, inspired by protein folding and biological neural communication.

Wave propagation through the 3D brain
Information propagating as waves through the 3D grid — from input (z=0) to output (z=15)

What is this?

A cube of 4,096 "mini-neurons" (16×16×16 grid) where each cell only communicates with its immediate neighbors. Information travels as waves through the cube. No attention mechanism (unlike Transformers), no sequential layers — it's a mathematical organism where language understanding emerges from local signal propagation.

35.4M parameters. Generates coherent English phrases of 6+ words.

This is not a Transformer. This is not an RNN. This is a 3D cellular automaton that learned to speak.

Key Results

Metric	Value
Eval word accuracy	10.7% (over 30K vocabulary)
Train word accuracy	16.9%
Train/eval gap	1.6x (best generalization of all versions)
Longest coherent output	"she started to play together again" (6 words)
Parameters	35.4M
Model size	68 MB
Grid	16×16×16 = 4,096 cells
Cell dimension	256
Training	16 epochs, ~10.7h on NVIDIA B200

Grammar emergence

The model learned grammatical categories without explicit supervision:

"the" + _        → nouns/adjectives (correct category)
"the big" + _    → nouns ("dog", "house", "girl")
"the dog" + _    → verbs ("was", "ran", "had")
"she wanted" + _ → "to" (infinitive structure)

Best generations

"she started to play together again"
"the little girl wanted to play with her parents"
"he said that he was very happy"
"in the morning she went to the garden"

Architecture

INPUT (face z=0)          THINKING (interior)         OUTPUT (face z=15)
┌─────────────┐          ┌─────────────┐             ┌─────────────┐
│ Tokens are  │   →→→    │ 3D waves    │    →→→      │ Prediction  │
│ injected    │  waves   │ propagate   │   waves     │ is read     │
│ into skin   │          │ N steps     │             │ from pole   │
└─────────────┘          └─────────────┘             └─────────────┘

How it works

Injection: Input tokens are embedded and injected into the z=0 face of the cube
Propagation: For N steps, each cell updates based on its 26 neighbors via Conv3d
Reading: The opposite face (z=15) is average-pooled and projected to vocabulary logits

Key innovations

Dilated convolutions with cycle [1, 2, 4, 8] — in 4 steps each cell "sees" the entire grid
Synaptic fatigue: 2 dedicated channels that inhibit over-firing cells, preventing repetition
Dual chemical pathways: Standard Conv3d + depthwise-separable Conv3d (diversity in transition rules)
Gated residual updates: Each cell decides how much to change per step via sigmoid gate

Emergent phenomena

3D brain activation map
3D brain map showing where "good" sentences activate more (red) vs "bad" sentences (blue)

Functional hemispheres: x=12 region produces better language than x=6
3 thinking phases: chaos (steps 1-5) → eureka (6-7) → decision (8-15)
Grammar in center, semantics in periphery of the grid
Semantic clustering: animals, family, nature, objects form distinct spatial clusters
Emotion highway: emotional content activates a specific depth layer (z=12)

Grammar vs Semantic channels
Channel specialization: grammar channels (red) vs semantic channels (blue) — the model spontaneously separated syntax from meaning

Quick Start

import torch
import torch.nn.functional as F
import json
from model import NCA3D_Fatigue

# Load dictionary
word2num = {k: int(v) for k, v in json.load(open("word_dictionary_30k.json")).items()}
num2word = {v: k for k, v in word2num.items()}

# Load model
model = NCA3D_Fatigue()
model.load_state_dict(torch.load("model_phase4c_v5_fatigue_best.pt", map_location="cpu"))
model.eval()

# Predict next word
context = ["the", "little", "girl"]
ids = [word2num[w] for w in context]
with torch.no_grad():
    logits = model(torch.tensor([ids]), n_steps=15)
    pred_id = logits.argmax(-1).item()
    print(f"'{' '.join(context)}' → '{num2word.get(pred_id, '?')}'")

# Generate a sequence
from inference import generate
print(generate(model, word2num, num2word, ["she", "wanted", "to"], max_words=8))

Model Architecture Details

Component                    Shape                          Params
────────────────────────────────────────────────────────────────────
word_embed                   Embedding(30006, 384)          11.5M
embed_proj                   Linear(384, 256)               98K
pos_embed                    Embedding(52, 256)             13K
init_state                   (1, 256, 16, 16, 16)           1.05M
trans1.conv1 (dilated)       Conv3d(256→512, k=3³)          3.5M
trans1.conv2 (dilated)       Conv3d(512→256, k=3³)          3.5M
trans2.dw_conv (dilated)     Conv3d(256→256, k=3³, groups)  6.9K
trans2.pw_conv               Conv3d(256→256, k=1)           65K
gate_conv                    Conv3d(256→256, k=1)           65K
norm (GroupNorm)              32 groups, 256 ch              512
out_proj                     256→512→30006                  15.6M
────────────────────────────────────────────────────────────────────
TOTAL                                                       ~35.4M

Training Details

Base model: Continued from v4 (dilated Conv3d + 30K vocab)
Datasets: WikiText-103, TinyStories, BookCorpus, IMDB, ROCStories, CNN/DailyMail, TriviaQA, Natural Questions, ELI5 (10 datasets total)
Schedule: 3 phases — 8ep×750K (aggressive) + 5ep×500K (consolidation) + 3ep×350K (refinement)
Learning rates: 5e-4→3e-4 | 2e-4→1e-4 | 8e-5→3e-5
Hardware: NVIDIA B200, 178GB VRAM peak
Training time: ~10.7 hours
Loss: Cross-entropy on 30K word vocabulary, multi-step loss (steps 7-16)

Project History

This model is the result of an extensive research journey:

Phase	What	Result
1	Arithmetic (8³ grid, 499K params)	98.4% on unseen data
2A	15 semantic relations	98.2% test, 87.5% generalization
2B	100 semantic relations	73.4% test (85.5% without "similar")
2B-v3	184 relations (grammar + semantics)	93.5% overall
3B	Q&A from relations	85% direct, 75% novel
3C	Transitive reasoning	52.5% holdout, 83.3% novel chains
4	Language as arithmetic	50.2% char accuracy, grammar emerges
4B	Multi-step loss	55.4% char accuracy
4C-v1→v4	Word embeddings, dilated conv, 30K vocab	Incremental improvements
4C-v5	Synaptic fatigue + intensive training	10.7% eval, 6+ word coherence

113 documented discoveries across all phases.

Why this matters

Transformers dominate NLP, but they have fundamental constraints:

O(n²) attention complexity
Fixed depth (always N layers regardless of problem difficulty)
No spatial locality between neurons
Billions of parameters required

NCA 3D Brain shows that local communication + iterative propagation can produce language-like behavior with:

O(n) complexity (each cell only sees neighbors)
Variable thinking depth (more steps = more reasoning)
Spatial structure with emergent functional zones
Orders of magnitude fewer parameters

This is early-stage research. The model doesn't compete with Transformers on quality — but it demonstrates that a fundamentally different computational paradigm can learn language structure.

Limitations

Accuracy is low compared to any Transformer (10.7% next-word prediction on 30K vocab)
Autoregressive generation accumulates errors — quality degrades after 6-8 words
Embeddings are partially disorganized (Zipf's law — rare words get few updates)
No extrapolation to longer contexts than trained on
CPU inference only (no optimized CUDA kernels)

Citation

@misc{quintela2026nca3d,
  title={NCA 3D Brain: Neural Cellular Automata for Language Processing},
  author={Cristian Quintela},
  year={2026},
  url={https://huggingface.co/killking69/nca3d-brain-v5}
}

Author

Cristian Quintela

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

Evaluation results

Eval Word Accuracy (30K vocab) on Multi-dataset (TinyStories, WikiText, BookCorpus, IMDB, Q&A)
self-reported

10.700
Train Word Accuracy (30K vocab) on Multi-dataset (TinyStories, WikiText, BookCorpus, IMDB, Q&A)
self-reported

16.900