Adaptive-Recurrent (ACT) 110M
A 110M-parameter recurrent transformer: instead of stacking N distinct layers, one shared block is applied recurrently up to 8 times per token, with a per-token halting head that decides how much to recur (Universal-Transformer / Adaptive-Computation-Time, Graves 2016 / Dehghani 2018). Transformer-XL segment memory gives it long context, trained out to 4096 tokens.
The recurrence reuses ~46M compute-active parameters up to 8ร, so the model does the compute of a ~0.4B-parameter network while storing only 110M parameters.
Files
| file | what |
|---|---|
base.safetensors |
pretrained base (text continuation, long context) |
sft.safetensors |
instruction-tuned (use this for chat) |
inference.py |
load + generate / interactive chat |
act_model_v2.py, train.py |
model definition |
Usage
pip install -r requirements.txt
python inference.py "What is the capital of France?" # short sampled answer (output varies)
python inference.py # interactive chat
MODEL=base python inference.py "Once upon a time" # base model continuation
Runs on CPU or GPU (auto-detected). The GPT-2 tokenizer is bundled in tokenizer/ โ nothing is downloaded. Generation is sampled (temperature 0.8), so answers vary between runs; at 110M the model is coherent but frequently wrong on facts (see Limitations).
Architecture
| hidden dim | 1280 |
| heads / head-dim | 20 / 64 |
| layers | 1 shared block, recurred up to 8ร (adaptive halting) |
| FFN | SwiGLU, 10240 |
| context | Transformer-XL memory, trained 512 โ 4096 |
| embedding | tied GPT-2 (vocab 50257) |
| params | 110M stored (~46M compute-active block) ยท ~0.4B effective compute (46M ร 8 steps + head) |
Training
- Pretrain: ~8.7B tokens, 13-corpus mix (FineWeb / code / books / arxiv / wiki / dialogue / pile), native FP8 on a single RTX PRO 6000 Blackwell. Long-context curriculum: trained at MEM 512 โ 1024 โ 2048 โ 4096 (the usable context wall moves with the trained memory).
- SFT: 707M tokens (Dolly, Alpaca, OpenOrca, OpenHermes), prompt-masked.
- DPO: preference optimization on UltraFeedback.
Results
- FineWeb held-out PPL 25.4 (the uploaded long-context base; the best stage-1 / short-context checkpoint reached 24.76) โ vs a 6-layer baseline of 35.8
- Per-domain macro PPL 15.79 (code 3.2 โ long-form web 27)
- Zero-shot macro-acc 0.41 (LAMBADA / HellaSwag / ARC-Easy / Winogrande)
- The recurrence is adaptive: depth tracks token difficulty (code ~6.3 steps, hard prose ~8).
Limitations
This is a 110M model (GPT-2-small class): it gives coherent, formatted answers to simple prompts but is weak on multi-step reasoning, precise facts (hallucinates), and long structured output. It's a research / demonstration model for adaptive-recurrence + single-card frontier-style training, not a production assistant. The SFT model is the better chatbot (the DPO pass increased verbosity at this scale).
Citation
If you use this, please cite the repo. Architecture: Universal Transformer (Dehghani et al. 2018) + Adaptive Computation Time (Graves 2016) + Transformer-XL memory (Dai et al. 2019).
- Downloads last month
- 1