YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Recurrent Transformer GPT-2
Small Transformer-XL style recurrent language model trained with a GPT-2 tokenizer.
Model
- Architecture: decoder-only recurrent transformer with detached KV memory
- Tokenizer: GPT-2 tokenizer (
vocab_size=50257) - Hidden size: 512
- Layers: 6
- Heads: 8
- Segment length: 128
- Memory length: 256
- Effective attention window: 384 tokens
Training
Pretraining used local parquet text shards copied from the neighboring project, then tokenized with the GPT-2 tokenizer.
The preserved pretrained weights are:
base.pt: best pretraining checkpointlatest_pretrain.pt: latest pretraining checkpoint
The full local checkpoint history is intentionally not included in this model upload because it is large. It is backed up locally under:
backups/pretrain_2026-06-17/
Fine-Tuning Data
The selected SFT dataset is databricks/databricks-dolly-15k, downloaded locally as:
data/sft/raw/databricks-dolly-15k.jsonl
It is suitable for this small model because it is compact, human-written, and single-turn instruction/QA oriented.
Intended Use
Research and experimentation with small recurrent transformer pretraining and SFT.