TemporalMesh Transformer: 29.4 PPL at 48% compute โ€” dynamic graph attention + adaptive exit gates (open-source, 226 tests)

#44
by vigneshwar234 - opened

TemporalMesh Transformer (TMT) โ€” New efficient transformer architecture

Sharing TMT, an open-source PyTorch transformer that jointly solves three problems no other architecture addresses together:

Three core problems โ†’ five innovations:

  • ๐Ÿ•ธ Mesh Attention: kNN graph rebuilt per-layer from cosine similarity โ†’ O(Sยทk) vs O(Sยฒ)
  • โฑ Temporal Decay: learned multiplicative attenuation post-softmax (not additive like ALiBi)
  • โšก Adaptive Depth Routing: per-token exit gate, punctuation exits layer 2, rare words layer 12
  • ๐Ÿ”€ Dual-Stream FFN: syntax + semantic parallel streams, sigmoid fusion
  • ๐Ÿง  EMA Memory Anchors: 16 persistent fast-weight vectors, cross-sequence recall

Results (120M params, WikiText-2):

Model PPL โ†“ Compute
Vanilla Transformer 42.1 100%
Longformer 39.6 62%
RWKV 33.1 50%
Mamba 31.8 55%
Full TMT 29.4 48%

Superadditive effect: combined gain = 12.7 PPL vs 8.6 from summing components individually.

๐Ÿ“„ Paper: https://zenodo.org/records/20287390
๐Ÿ’ป Code + 226 tests: https://github.com/vignesh2027/TemporalMesh-Transformer
๐ŸŽฎ Live demo: https://huggingface.co/spaces/vigneshwar234/TemporalMesh-Transformer-Demo
๐Ÿค— Model: https://huggingface.co/vigneshwar234/TemporalMesh-Transformer

Sign up or log in to comment