nanoprot-gpt2-M
A protein language model (gpt2 architecture) β 135.4M (135,390,414) parameters, trained on 1.62B (1,624,684,968) UniRef50 residues.
nanoprot-gpt2-M is part of the nanoprot suite: a Pythia-style matrix of protein
language models spanning three architectures (gpt2, esm2, mamba) and four
scales (XS/S/M/L), each trained from scratch on UniRef50 under a matched,
Chinchilla-style data budget. The suite is built for controlled comparison β
same data, same tokenizer, one variable at a time.
Headline result
Validation bits-per-residue (lower is better): 3.6838 Β± 0.0013 (n=3 seeds)
Directly comparable to other autoregressive nanoprot models (
gpt2,mamba) β same 33-token vocabulary, same AR objective, same data budget. Not comparable to theesm2(masked-LM) models, whose metric is a different quantity.
Model details
| Architecture | gpt2 |
| Objective | AR (causal-language-modeling) |
| Scale rung | M |
| Parameters | 135.4M (135,390,414) |
| Layers (depth) | 14 |
| Hidden size (d_model) | 896 |
| Attention heads | 14 |
| Max sequence length | 512 |
| Vocabulary | 33-token residue alphabet (ESM-2) |
| MLP activation | relu_squared |
| logit softcap | 15.0 |
| window pattern | L |
| Precision | bf16 |
Training
| Data | UniRef50 release 2026_01 (28-Jan-2026), 60,251,814 sequences; held-out final shard for validation |
| Tokenizer | esm2 β 33-token residue alphabet (shared across the whole suite) |
| Optimizer | Muon (matrices) + AdamW (embeddings/scalars), weight_decay=0.1 |
| Batch size | 524,288 residues/step |
| Optimizer steps | 3,098 |
| Residues seen | 1.62B (1,624,684,968) |
| Param/data ratio | 12.0 (Chinchilla-style) |
| Total FLOPs | 1.441e+18 |
| Wall-clock | 0.69 h (41 min) on 4 GPU(s) |
| Seed | 0 (siblings: see below) |
| nanoprot version | 0.5.0 |
Evaluation
Evaluated on a held-out UniRef50 shard. Validation bits-per-residue (lower is better): 3.6838 Β± 0.0013 (n=3 seeds).
Directly comparable to other autoregressive nanoprot models (gpt2, mamba) β same 33-token vocabulary, same AR objective, same data budget. Not comparable to the esm2 (masked-LM) models, whose metric is a different quantity.
Intended use & limitations
Research use: learning protein representations, extracting residual-stream features, mechanistic-interpretability probing, and architecture comparison. Trained only on UniRef50 sequences β not for clinical or diagnostic use, and not aligned to any downstream task out of the box.
How to load
Install nanoprot (pip install nanoprot), download this repo, and point the
arch-aware loader at the folder β it works for any nanoprot architecture
(gpt2 / esm2 / mamba), reading the embedded config and selecting the right
tokenizer automatically.
from nanoprot.training.checkpoint import load_pretrained
model, cfg, meta, tokenizer = load_pretrained(
"path/to/this/repo", device="cpu", return_tokenizer=True,
)
model.eval()
# meta carries the trained-artifact facts (params, FLOPs, val metric, ...)
The nanoprot suite
Hub: yagizdevre/nanoprot-gpt2-M (seed 0 is the default; siblings on branches seed1,
seed2). This model's sibling seeds:
nanoprot-gpt2-M-s0β final bits-per-residue 3.6835nanoprot-gpt2-M-s1β final bits-per-residue 3.6852nanoprot-gpt2-M-s2β final bits-per-residue 3.6827
The full suite spans {gpt2, esm2, mamba} x {XS, S, M, L} x {seed 0,1,2}.
See the nanoprot repository for the
complete grid and the scaling-curve comparisons.
Citation
@software{nanoprot,
author = {Devre, H. Yagiz},
title = {nanoprot: a minimal training framework for protein language models},
year = {2026},
url = {https://github.com/ygzdvr/nanoprot}
}
Reproducibility
Trained with nanoprot v0.5.0 (corpus prepared 2026-06-01T12:51:52+00:00). The exact, complete training
config is in config.yaml (also embedded in meta_003098.json).
Re-train with:
python -m scripts.train --config config.yaml
- Downloads last month
- 7
Collection including yagizdevre/nanoprot-gpt2-M
Evaluation results
- bits-per-residue on UniRef50 (held-out shard)self-reported3.684