TemporalMesh Transformer: 29.4 PPL at 48% compute — dynamic graph attention + adaptive exit gates (open-source, 226 tests)

#39
by vigneshwar234 - opened

TemporalMesh Transformer (TMT) — New efficient transformer architecture

Sharing TMT, an open-source PyTorch transformer that jointly solves three problems no other architecture addresses together:

Three core problems → five innovations:

  • 🕸 Mesh Attention: kNN graph rebuilt per-layer from cosine similarity → O(S·k) vs O(S²)
  • Temporal Decay: learned multiplicative attenuation post-softmax (not additive like ALiBi)
  • Adaptive Depth Routing: per-token exit gate, punctuation exits layer 2, rare words layer 12
  • 🔀 Dual-Stream FFN: syntax + semantic parallel streams, sigmoid fusion
  • 🧠 EMA Memory Anchors: 16 persistent fast-weight vectors, cross-sequence recall

Results (120M params, WikiText-2):

Model PPL ↓ Compute
Vanilla Transformer 42.1 100%
Longformer 39.6 62%
RWKV 33.1 50%
Mamba 31.8 55%
Full TMT 29.4 48%

Superadditive effect: combined gain = 12.7 PPL vs 8.6 from summing components individually.

📄 Paper: https://zenodo.org/records/20287390
💻 Code + 226 tests: https://github.com/vignesh2027/TemporalMesh-Transformer
🎮 Live demo: https://huggingface.co/spaces/vigneshwar234/TemporalMesh-Transformer-Demo
🤗 Model: https://huggingface.co/vigneshwar234/TemporalMesh-Transformer

Sign up or log in to comment