YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Recurrent Transformer GPT-2

Small Transformer-XL style recurrent language model trained with a GPT-2 tokenizer.

Model

  • Architecture: decoder-only recurrent transformer with detached KV memory
  • Tokenizer: GPT-2 tokenizer (vocab_size=50257)
  • Hidden size: 512
  • Layers: 6
  • Heads: 8
  • Segment length: 128
  • Memory length: 256
  • Effective attention window: 384 tokens

Training

Pretraining used local parquet text shards copied from the neighboring project, then tokenized with the GPT-2 tokenizer.

The preserved pretrained weights are:

  • base.pt: best pretraining checkpoint
  • latest_pretrain.pt: latest pretraining checkpoint

The full local checkpoint history is intentionally not included in this model upload because it is large. It is backed up locally under:

backups/pretrain_2026-06-17/

Fine-Tuning Data

The selected SFT dataset is databricks/databricks-dolly-15k, downloaded locally as:

data/sft/raw/databricks-dolly-15k.jsonl

It is suitable for this small model because it is compact, human-written, and single-turn instruction/QA oriented.

Intended Use

Research and experimentation with small recurrent transformer pretraining and SFT.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support