Adaptive-Recurrent (ACT) 110M

A 110M-parameter recurrent transformer: instead of stacking N distinct layers, one shared block is applied recurrently up to 8 times per token, with a per-token halting head that decides how much to recur (Universal-Transformer / Adaptive-Computation-Time, Graves 2016 / Dehghani 2018). Transformer-XL segment memory gives it long context, trained out to 4096 tokens.

The recurrence reuses ~46M compute-active parameters up to 8ร—, so the model does the compute of a ~0.4B-parameter network while storing only 110M parameters.

Files

file what
base.safetensors pretrained base (text continuation, long context)
sft.safetensors instruction-tuned (use this for chat)
inference.py load + generate / interactive chat
act_model_v2.py, train.py model definition

Usage

pip install -r requirements.txt
python inference.py "What is the capital of France?"   # short sampled answer (output varies)
python inference.py                                     # interactive chat
MODEL=base python inference.py "Once upon a time"       # base model continuation

Runs on CPU or GPU (auto-detected). The GPT-2 tokenizer is bundled in tokenizer/ โ€” nothing is downloaded. Generation is sampled (temperature 0.8), so answers vary between runs; at 110M the model is coherent but frequently wrong on facts (see Limitations).

Architecture

hidden dim 1280
heads / head-dim 20 / 64
layers 1 shared block, recurred up to 8ร— (adaptive halting)
FFN SwiGLU, 10240
context Transformer-XL memory, trained 512 โ†’ 4096
embedding tied GPT-2 (vocab 50257)
params 110M stored (~46M compute-active block) ยท ~0.4B effective compute (46M ร— 8 steps + head)

Training

  • Pretrain: ~8.7B tokens, 13-corpus mix (FineWeb / code / books / arxiv / wiki / dialogue / pile), native FP8 on a single RTX PRO 6000 Blackwell. Long-context curriculum: trained at MEM 512 โ†’ 1024 โ†’ 2048 โ†’ 4096 (the usable context wall moves with the trained memory).
  • SFT: 707M tokens (Dolly, Alpaca, OpenOrca, OpenHermes), prompt-masked.
  • DPO: preference optimization on UltraFeedback.

Results

  • FineWeb held-out PPL 25.4 (the uploaded long-context base; the best stage-1 / short-context checkpoint reached 24.76) โ€” vs a 6-layer baseline of 35.8
  • Per-domain macro PPL 15.79 (code 3.2 โ†’ long-form web 27)
  • Zero-shot macro-acc 0.41 (LAMBADA / HellaSwag / ARC-Easy / Winogrande)
  • The recurrence is adaptive: depth tracks token difficulty (code ~6.3 steps, hard prose ~8).

Limitations

This is a 110M model (GPT-2-small class): it gives coherent, formatted answers to simple prompts but is weak on multi-step reasoning, precise facts (hallucinates), and long structured output. It's a research / demonstration model for adaptive-recurrence + single-card frontier-style training, not a production assistant. The SFT model is the better chatbot (the DPO pass increased verbosity at this scale).

Citation

If you use this, please cite the repo. Architecture: Universal Transformer (Dehghani et al. 2018) + Adaptive Computation Time (Graves 2016) + Transformer-XL memory (Dai et al. 2019).

Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support