nanoprot-gpt2-M

A protein language model (gpt2 architecture) β€” 135.4M (135,390,414) parameters, trained on 1.62B (1,624,684,968) UniRef50 residues.

nanoprot-gpt2-M is part of the nanoprot suite: a Pythia-style matrix of protein language models spanning three architectures (gpt2, esm2, mamba) and four scales (XS/S/M/L), each trained from scratch on UniRef50 under a matched, Chinchilla-style data budget. The suite is built for controlled comparison β€” same data, same tokenizer, one variable at a time.

Headline result

Validation bits-per-residue (lower is better): 3.6838 Β± 0.0013 (n=3 seeds)

Directly comparable to other autoregressive nanoprot models (gpt2, mamba) β€” same 33-token vocabulary, same AR objective, same data budget. Not comparable to the esm2 (masked-LM) models, whose metric is a different quantity.

Model details

Architecture gpt2
Objective AR (causal-language-modeling)
Scale rung M
Parameters 135.4M (135,390,414)
Layers (depth) 14
Hidden size (d_model) 896
Attention heads 14
Max sequence length 512
Vocabulary 33-token residue alphabet (ESM-2)
MLP activation relu_squared
logit softcap 15.0
window pattern L
Precision bf16

Training

Data UniRef50 release 2026_01 (28-Jan-2026), 60,251,814 sequences; held-out final shard for validation
Tokenizer esm2 β€” 33-token residue alphabet (shared across the whole suite)
Optimizer Muon (matrices) + AdamW (embeddings/scalars), weight_decay=0.1
Batch size 524,288 residues/step
Optimizer steps 3,098
Residues seen 1.62B (1,624,684,968)
Param/data ratio 12.0 (Chinchilla-style)
Total FLOPs 1.441e+18
Wall-clock 0.69 h (41 min) on 4 GPU(s)
Seed 0 (siblings: see below)
nanoprot version 0.5.0

Evaluation

Evaluated on a held-out UniRef50 shard. Validation bits-per-residue (lower is better): 3.6838 Β± 0.0013 (n=3 seeds).

Directly comparable to other autoregressive nanoprot models (gpt2, mamba) β€” same 33-token vocabulary, same AR objective, same data budget. Not comparable to the esm2 (masked-LM) models, whose metric is a different quantity.

Intended use & limitations

Research use: learning protein representations, extracting residual-stream features, mechanistic-interpretability probing, and architecture comparison. Trained only on UniRef50 sequences β€” not for clinical or diagnostic use, and not aligned to any downstream task out of the box.

How to load

Install nanoprot (pip install nanoprot), download this repo, and point the arch-aware loader at the folder β€” it works for any nanoprot architecture (gpt2 / esm2 / mamba), reading the embedded config and selecting the right tokenizer automatically.

from nanoprot.training.checkpoint import load_pretrained

model, cfg, meta, tokenizer = load_pretrained(
    "path/to/this/repo", device="cpu", return_tokenizer=True,
)
model.eval()
# meta carries the trained-artifact facts (params, FLOPs, val metric, ...)

The nanoprot suite

Hub: yagizdevre/nanoprot-gpt2-M (seed 0 is the default; siblings on branches seed1, seed2). This model's sibling seeds:

  • nanoprot-gpt2-M-s0 β€” final bits-per-residue 3.6835
  • nanoprot-gpt2-M-s1 β€” final bits-per-residue 3.6852
  • nanoprot-gpt2-M-s2 β€” final bits-per-residue 3.6827

The full suite spans {gpt2, esm2, mamba} x {XS, S, M, L} x {seed 0,1,2}. See the nanoprot repository for the complete grid and the scaling-curve comparisons.

Citation

@software{nanoprot,
  author  = {Devre, H. Yagiz},
  title   = {nanoprot: a minimal training framework for protein language models},
  year    = {2026},
  url     = {https://github.com/ygzdvr/nanoprot}
}

Reproducibility

Trained with nanoprot v0.5.0 (corpus prepared 2026-06-01T12:51:52+00:00). The exact, complete training config is in config.yaml (also embedded in meta_003098.json). Re-train with:

python -m scripts.train --config config.yaml
Downloads last month
7
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including yagizdevre/nanoprot-gpt2-M

Evaluation results

  • bits-per-residue on UniRef50 (held-out shard)
    self-reported
    3.684