Adaptive-Recurrent (ACT) 110M

A 110M-parameter recurrent transformer: instead of stacking N distinct layers, one shared block is applied recurrently up to 8 times per token, with a per-token halting head that decides how much to recur (Universal-Transformer / Adaptive-Computation-Time, Graves 2016 / Dehghani 2018). Transformer-XL segment memory gives it long context, trained out to 4096 tokens.

The recurrence reuses ~46M compute-active parameters up to 8×, so the model does the compute of a ~0.4B-parameter network while storing only 110M parameters.

Files

file	what
`base.safetensors`	pretrained base (text continuation, long context)
`sft.safetensors`	instruction-tuned (use this for chat)
`inference.py`	load + generate / interactive chat
`act_model_v2.py`, `train.py`	model definition

Usage

pip install -r requirements.txt
python inference.py "What is the capital of France?"   # short sampled answer (output varies)
python inference.py                                     # interactive chat
MODEL=base python inference.py "Once upon a time"       # base model continuation

Runs on CPU or GPU (auto-detected). The GPT-2 tokenizer is bundled in tokenizer/ — nothing is downloaded. Generation is sampled (temperature 0.8), so answers vary between runs; at 110M the model is coherent but frequently wrong on facts (see Limitations).

Architecture


hidden dim	1280
heads / head-dim	20 / 64
layers	1 shared block, recurred up to 8× (adaptive halting)
FFN	SwiGLU, 10240
context	Transformer-XL memory, trained 512 → 4096
embedding	tied GPT-2 (vocab 50257)
params	110M stored (~46M compute-active block) · ~0.4B effective compute (46M × 8 steps + head)

Training

Pretrain: ~8.7B tokens, 13-corpus mix (FineWeb / code / books / arxiv / wiki / dialogue / pile), native FP8 on a single RTX PRO 6000 Blackwell. Long-context curriculum: trained at MEM 512 → 1024 → 2048 → 4096 (the usable context wall moves with the trained memory).
SFT: 707M tokens (Dolly, Alpaca, OpenOrca, OpenHermes), prompt-masked.
DPO: preference optimization on UltraFeedback.

Results

FineWeb held-out PPL 25.4 (the uploaded long-context base; the best stage-1 / short-context checkpoint reached 24.76) — vs a 6-layer baseline of 35.8
Per-domain macro PPL 15.79 (code 3.2 → long-form web 27)
Zero-shot macro-acc 0.41 (LAMBADA / HellaSwag / ARC-Easy / Winogrande)
The recurrence is adaptive: depth tracks token difficulty (code ~6.3 steps, hard prose ~8).

Limitations

This is a 110M model (GPT-2-small class): it gives coherent, formatted answers to simple prompts but is weak on multi-step reasoning, precise facts (hallucinates), and long structured output. It's a research / demonstration model for adaptive-recurrence + single-card frontier-style training, not a production assistant. The SFT model is the better chatbot (the DPO pass increased verbosity at this scale).

Citation

If you use this, please cite the repo. Architecture: Universal Transformer (Dehghani et al. 2018) + Adaptive Computation Time (Graves 2016) + Transformer-XL memory (Dai et al. 2019).

Downloads last month: 1