RRT-355M — softmax-free attention at GPT-2 Medium scale
Headline result: a GPT-2 Medium–shaped checkpoint (~354 M parameters) trained from scratch without softmax, evaluated on a standardized 22-task in-context learning benchmark, with open kernels where sparse inference is bit-identical to dense on this checkpoint.
This Hugging Face repo ships weights, config, and substrate constants only. Inference requires the RRT engine on GitHub (RRT-LLM-FOUNDATION, AGPL-3.0). Stock transformers GPT-2 will produce incorrect outputs.
Training is complete. No additional checkpoints are planned from this repository.
Capability evaluation (22-task CORE)
| Model | CORE | Notes |
|---|---|---|
| GPT-2 124M | 0.1211 | floor reference, same harness |
| GPT-2 medium | 0.1770 | dense softmax foil, matched scale |
| RRT-355M | 0.1558 | softmax-free, this checkpoint |
| Pythia 410M | 0.1895 | modern baseline, same harness |
CORE = mean centered accuracy across 22 in-context learning tasks (DCLM protocol, Karpathy nanochat eval_bundle). RRT-355M is 0.021 below the GPT-2 medium foil and 0.035 above the GPT-2 124M floor — a measurable tradeoff, not a capability collapse.
Task asymmetry (RRT − GPT-2 medium, centered score): gains on multiple-choice reasoning (arc_easy +0.12, agi_eval_lsat_ar +0.09, openbook_qa +0.07); largest regressions on continuation tasks (lambada_openai −0.16, coqa −0.13, squad −0.07).
Not evaluated: MMLU, GSM8K, HumanEval, chat/instruction benchmarks, or fine-tuned downstream tasks. Details: eval/eval_summary.json on this repo; full write-up on GitHub docs/EVALUATION.md.
Mechanism and training
| Metric | Value | Notes |
|---|---|---|
| Structural edge sparsity | 99.66 % | fidelity gate; training measurement |
| Training data | FineWeb-Edu | 11.534 B tokens, 4× H100, 22k iters |
| Best val loss (ckpt) | 2.8001 | iteration 21 000 |
| Weight file | ~1011 MB bf16 | model.safetensors |
Three metrics — do not conflate: (1) structural sparsity during training, (2) coarse-tile skip at inference (34–55%, long context), (3) CORE behavioral score above.
Each attention edge applies friction ln(max(i−j, 1)) and gate μ = η / (1 + η^n)^(1/n) with n = 1.25. INT8 pre-pass skips tiles with no active edges; bit-identical to dense on this checkpoint. v2 kernel: 21/22 CORE tasks identical to v1 (Δ CORE −0.0016).
Systems notes (secondary)
| Metric | Value | Caveat |
|---|---|---|
| INT8 tile skip @ T=2048 / 8192 | 34% / 55% | layer-12 micro-bench, H100 |
| Kernel vs SDPA @ T=2048 | 11.5× | not end-to-end generation |
| Peak attention VRAM @ T=16384 | 5.5 GB | GPT-2 XL reference forward, RTX 3070 |
Files in this repo
| File | Purpose |
|---|---|
model.safetensors |
bf16 weights |
config.json |
architecture metadata |
rrt_substrate_constants.json |
inference requires n_backbone, C_max only |
eval/ |
CORE summary JSON, comparison CSV, parity notes |
figures/ |
key charts from benchmark report |
tokenizer_pointer.txt |
openai-community/gpt2 BPE |
Reproduce
git clone https://github.com/tripstoph/RRT-LLM-FOUNDATION.git
cd RRT-LLM-FOUNDATION
pip install -e .
python eval/run_core_eval.py --model rrt:_state/ckpt.pt --snapshot-dir engine --seed 1337
# Quick smoke (~minutes): python eval/smoke_core.py --model rrt:_state/ckpt.pt --snapshot-dir engine
Expected full CORE: 0.1558. Claims ↔ evidence: GitHub docs/CLAIMS.md.
Scope
RRT-355M validates the attention mechanism in isolation. Broader pipeline work is explored separately under Relational Autopoietic Substrate (RAS); no timeline or additional model releases are committed from this repository.
Limitations
- Custom Triton engine (Hopper sm_90); not
AutoModelForCausalLM - CORE below dense GPT-2 medium at matched scale
- Single checkpoint; no scale-up from this repo
- Speed/memory figures are kernel benchmarks with stated context
Citation
@misc{rrt-355m-2026,
author = {Tripstoph},
title = {RRT-355M: Softmax-free attention at GPT-2 Medium scale},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/Tripstoph/RRT-Foundation}},
note = {Proof-of-mechanism weights; engine at GitHub under AGPL-3.0.},
}
Last updated: 2026-06-21
- Downloads last month
- -

