nanoprot-gpt2-M

A protein language model (gpt2 architecture) — 135.4M (135,390,414) parameters, trained on 1.62B (1,624,684,968) UniRef50 residues.

nanoprot-gpt2-M is part of the nanoprot suite: a Pythia-style matrix of protein language models spanning three architectures (gpt2, esm2, mamba) and four scales (XS/S/M/L), each trained from scratch on UniRef50 under a matched, Chinchilla-style data budget. The suite is built for controlled comparison — same data, same tokenizer, one variable at a time.

Headline result

Validation bits-per-residue (lower is better): 3.6838 ± 0.0013 (n=3 seeds)

Directly comparable to other autoregressive nanoprot models (gpt2, mamba) — same 33-token vocabulary, same AR objective, same data budget. Not comparable to the esm2 (masked-LM) models, whose metric is a different quantity.

Model details


Architecture	`gpt2`
Objective	AR (causal-language-modeling)
Scale rung	M
Parameters	135.4M (135,390,414)
Layers (depth)	14
Hidden size (d_model)	896
Attention heads	14
Max sequence length	512
Vocabulary	33-token residue alphabet (ESM-2)
MLP activation	relu_squared
logit softcap	15.0
window pattern	L
Precision	bf16

Training


Data	UniRef50 release 2026_01 (28-Jan-2026), 60,251,814 sequences; held-out final shard for validation
Tokenizer	`esm2` — 33-token residue alphabet (shared across the whole suite)
Optimizer	Muon (matrices) + AdamW (embeddings/scalars), weight_decay=0.1
Batch size	524,288 residues/step
Optimizer steps	3,098
Residues seen	1.62B (1,624,684,968)
Param/data ratio	12.0 (Chinchilla-style)
Total FLOPs	1.441e+18
Wall-clock	0.69 h (41 min) on 4 GPU(s)
Seed	0 (siblings: see below)
nanoprot version	0.5.0

Evaluation

Evaluated on a held-out UniRef50 shard. Validation bits-per-residue (lower is better): 3.6838 ± 0.0013 (n=3 seeds).

Directly comparable to other autoregressive nanoprot models (gpt2, mamba) — same 33-token vocabulary, same AR objective, same data budget. Not comparable to the esm2 (masked-LM) models, whose metric is a different quantity.

Intended use & limitations

Research use: learning protein representations, extracting residual-stream features, mechanistic-interpretability probing, and architecture comparison. Trained only on UniRef50 sequences — not for clinical or diagnostic use, and not aligned to any downstream task out of the box.

How to load

Install nanoprot (pip install nanoprot), download this repo, and point the arch-aware loader at the folder — it works for any nanoprot architecture (gpt2 / esm2 / mamba), reading the embedded config and selecting the right tokenizer automatically.

from nanoprot.training.checkpoint import load_pretrained

model, cfg, meta, tokenizer = load_pretrained(
    "path/to/this/repo", device="cpu", return_tokenizer=True,
)
model.eval()
# meta carries the trained-artifact facts (params, FLOPs, val metric, ...)

The nanoprot suite

Hub: yagizdevre/nanoprot-gpt2-M (seed 0 is the default; siblings on branches seed1, seed2). This model's sibling seeds:

nanoprot-gpt2-M-s0 — final bits-per-residue 3.6835
nanoprot-gpt2-M-s1 — final bits-per-residue 3.6852
nanoprot-gpt2-M-s2 — final bits-per-residue 3.6827

The full suite spans {gpt2, esm2, mamba} x {XS, S, M, L} x {seed 0,1,2}. See the nanoprot repository for the complete grid and the scaling-curve comparisons.

Citation

@software{nanoprot,
  author  = {Devre, H. Yagiz},
  title   = {nanoprot: a minimal training framework for protein language models},
  year    = {2026},
  url     = {https://github.com/ygzdvr/nanoprot}
}

Reproducibility

Trained with nanoprot v0.5.0 (corpus prepared 2026-06-01T12:51:52+00:00). The exact, complete training config is in config.yaml (also embedded in meta_003098.json). Re-train with:

python -m scripts.train --config config.yaml

Downloads last month: 7

Collection including yagizdevre/nanoprot-gpt2-M

nanoprot v0.5 — protein LM scaling suite

Collection

36 protein LMs: 3 archs x 4 scales x 3 seeds on UniRef50, compute-optimal. Transformers out-scale Mamba SSMs. Code: github.com/ygzdvr/nanoprot • 12 items • Updated 13 days ago

Evaluation results

bits-per-residue on UniRef50 (held-out shard)
self-reported

3.684