triprorep-650M

Structure-aware protein encoder, 650M parameters. ELECTRA-style corrective MLM pre-training on 83.6M ATLAS + PDB structures. The encoder reads three per-residue token streams (seq / bb / fa) and outputs a per-residue embedding of dimension 1280 (fp16).

Architecture: embed_dim=1280, encoder_depth=30, encoder_heads=20.

Part of the k-fold-structure release.

Files

650M.ckpt: full Lightning checkpoint.
config.yaml: model + data config. The path fields are placeholders, point them at your local data.
backbone_tokenizer.pt, fullatom_tokenizer.pt: structure tokenizers (PDB → token IDs). See Acknowledgements.

Usage

pip install torch huggingface_hub omegaconf numpy lmdb biotite
git clone https://github.com/<github-org>/k-fold-structure-release.git
cd k-fold-structure-release

import sys; sys.path.insert(0, "code/triprorep")
from inference import load_encoder, embed_pdb

encoder  = load_encoder("650M", hf_repo="k-fold-structure/triprorep-650M")
features = embed_pdb(encoder, "your_protein.pdb",
                     hf_repo="k-fold-structure/triprorep-650M")
print(features.shape)   # (L, 1280) fp16

embed_pdb downloads the bundled tokenizers from this repo on first call, then runs PDB → (seq, bb, fa) tokens → encoder. If you already have token IDs (e.g. from k-fold-structure/repsp-triprorep-tokens), call encode(encoder, seq, bb, fa) directly. For CPU, pass device="cpu" to load_encoder.

Acknowledgements

backbone_tokenizer.pt (aminoaseed VQ-VAE) is from StructTokenBench.

Citation

@misc{kfoldstructure,
  title  = {K-Fold Structure: Structure-Aware Protein Encoders and a Per-Residue Representation Benchmark},
  author = {<authors>},
  year   = {2026},
  url    = {https://huggingface.co/k-fold-structure}
}

License

MIT

Downloads last month: 35

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support