LoopLM-135M-naive

A 135M parameter dense looped transformer trained from scratch on FineWeb. Built as part of an exploration of looped LLM architectures inspired by Parcae.

This is the naive looped variant — a clean baseline without Parcae's LTI stability mechanisms, which were found to underperform at this scale across 5 ablations.

📂 Code: github.com/harims95/LoopLM 📄 Parcae paper: arXiv:2604.12946

Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("harims95/LoopLM-135M-naive")
model = AutoModelForCausalLM.from_pretrained(
    "harims95/LoopLM-135M-naive",
    trust_remote_code=True,
)
model.eval()

inputs = tokenizer("The capital of France is", return_tensors="pt")
with torch.no_grad():
    out = model(**inputs)

next_token_id = out.logits[0, -1].argmax().item()
print(tokenizer.decode([next_token_id]))

For generation with sampling (top-k + temperature), use scripts/generate.py.

Architecture

Input tokens
    ↓
[Embedding]
    ↓
[Prelude: 4 transformer blocks]
    ↓
e (input injection)
    ↓
[Loop block × T loops]  ← T ~ Poisson(μ=6) per-sequence
    ↓                      Update: h_{t+1} = block(h + e)
h_final
    ↓
[Coda: 2 transformer blocks]
    ↓
[Tied lm_head] → logits

Specs:

  • Type: Dense looped transformer (recurrent reuse of one transformer block)
  • Total params: 135M (134.1M unique trainable, tied input/output embeddings)
  • d_model: 1024
  • Attention: GQA with 16 query heads / 8 KV heads, head_dim=64
  • Position encoding: RoPE (θ=10000)
  • Normalization: RMSNorm pre-norm, QK-norm
  • FFN: SwiGLU, dense_ffn=2816
  • Vocab: 50304 (GPT-2 BPE + padding), tied embeddings

Training

Dataset FineWeb (raw, kjj0/fineweb10B-gpt2)
Tokens consumed ~4.6B
Steps 17,500
Hardware 2× H100 on Modal
Wall clock ~3 hours
Total cost ~$22

Hyperparameters:

  • Batch: 262,144 tokens/step (micro=32 × seq=1024 × 2 GPUs × accum=4)
  • Optimizer: Muon (matrices) + AdamW (norms, biases, embeddings)
  • LR: Muon 0.02, AdamW 3e-4
  • Schedule: 100-step warmup, 60% constant LR, 40% cosine decay to 0.1× peak
  • Precision: bf16 with fp32 logits
  • μ_rec=6 Poisson per-sequence loop depth
  • μ_bwd=3 truncated BPTT (gradients only through last 3 loops)

Results

Model Architecture Tokens Val Loss (FineWeb)
HobbyLM-30M (prior) Dense (8 layers) 1B 3.91
LoopLM-135M-naive (this) Dense looped 4.6B 3.95
HobbyLM-130M MoE (sibling) MoE (140M total / 62M active) 10B 3.30

At this scale, sparse MoE remains more sample-efficient than dense looped. Looping clearly helped vs the 30M dense baseline but didn't surpass MoE at matched parameters.

The Parcae Investigation (Honest Findings)

This project began as an attempt to reproduce Parcae's LTI stability mechanisms for looped LMs. Across 5 ablations, none of the Parcae variants beat the naive baseline:

# Variant Description Final Val
1 Naive h_{t+1} = block(h + e) 3.84 (FineWeb-Edu)
2 A matrix + LTI step in parallel 3.84 (tied)
3 + input norm v1 Wrong arch flow diverged
4 + LTI before block Fixed arch + B identity init worse
5 + B → AdamW (wd=0) Match official optimizer routing dramatically worse

Each "fix" — bringing the implementation closer to the official Parcae code — made performance worse. After consulting the paper's Appendix Q, the official repo, and multiple debugging passes:

Parcae's stability mechanisms appear to require larger scale (1B+ params, 100B+ tokens) to demonstrate benefit. At 135M params / 4.6B tokens, naive looped reuse is competitive enough.

The Parcae paper itself reports its stability tricks help most when training runs into "late-stage loss spikes after 170k steps." Our runs were at 17.5k steps. We never reached the regime where these mechanisms pay off.

Example Outputs (Cherrypicked)

Generated at temperature=0.8, top_k=50:

Prompt: "The advantages of solar energy include"

The advantages of solar energy include the advantages of solar energy. At the same time, solar energy is used for generating electricity, and solar energy is the first choice for solar power generation. Solar energy is generally renewable, and is considered a renewable energy.

Prompt: "Once upon a time, in a small village,"

Once upon a time, in a small village, where you could be greeted with a gentle, friendly face. This beautiful, charming village is situated on a calm, peaceful setting; with its peaceful nature and calm nature, this charming village does not exist.

Honest assessment: Locally fluent, syntactically valid English. Prone to repetition and invented facts. Expected behavior for a 135M model trained on ~4.6B tokens — not competitive with modern instruction-tuned models, but a clean from-scratch baseline.

Reproducibility

Full training code: github.com/harims95/LoopLM

spec.json in this repo contains the exact training configuration (CLI args, model config, train config, git commit hash, GPU type, PyTorch version).

To reproduce the training run:

git clone https://github.com/harims95/LoopLM
cd LoopLM
pip install -r requirements.txt
pip install modal && modal token new

# Download FineWeb shards to Modal volume
python -m modal run --detach training/modal_train.py \
    --action download --dataset fineweb --shards 50

# Train
python -m modal run --detach training/modal_train.py \
    --action train --preset 135M --steps 20000 \
    --run-name looplm_naive --gpus 2 --micro 32 \
    --seq-len 1024 --batch-tokens 262144 \
    --dataset fineweb \
    --overrides "use_a_matrix=false,use_input_norm=false"

Limitations

  • Not instruction-tuned. This is a base model only.
  • Small. 135M parameters; expect hallucination and limited factual recall.
  • Repetition. No repetition penalty applied at training; generation benefits from top_k sampling.
  • No .generate() polish. The HF wrapper returns logits; a vanilla sampling loop is in scripts/generate.py.
  • English only. Tokenizer is GPT-2 BPE; training data is FineWeb English.

Acknowledgments

  • @harishsg993010 — training infrastructure (Muon, data loader, Modal harness, optimizer setup)
  • Sandy Research — official Parcae implementation that helped me debug
  • The Parcae authors — paper and honest scaling analysis
  • kjj0 — FineWeb GPT-2 tokenized shards
  • Modal Labs — accessible H100 training

License

Apache 2.0

Citation

If you use this model or the Parcae findings, please cite:

@misc{looplm-135m-naive,
  author = {Hari},
  title = {LoopLM-135M-naive: A dense looped transformer with honest Parcae ablations},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/harims95/LoopLM-135M-naive}
}
Downloads last month
-
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train harims95/LoopLM-135M-naive

Paper for harims95/LoopLM-135M-naive