TemporalMesh Transformer: 29.4 PPL at 48% compute — dynamic graph attention + adaptive exit gates (open-source, 226 tests)

#39

by vigneshwar234 - opened 2 days ago

TemporalMesh Transformer (TMT) — New efficient transformer architecture

Sharing TMT, an open-source PyTorch transformer that jointly solves three problems no other architecture addresses together:

Three core problems → five innovations:

🕸 Mesh Attention: kNN graph rebuilt per-layer from cosine similarity → O(S·k) vs O(S²)
⏱ Temporal Decay: learned multiplicative attenuation post-softmax (not additive like ALiBi)
⚡ Adaptive Depth Routing: per-token exit gate, punctuation exits layer 2, rare words layer 12
🔀 Dual-Stream FFN: syntax + semantic parallel streams, sigmoid fusion
🧠 EMA Memory Anchors: 16 persistent fast-weight vectors, cross-sequence recall

Results (120M params, WikiText-2):

Superadditive effect: combined gain = 12.7 PPL vs 8.6 from summing components individually.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment