triprorep-650M
Structure-aware protein encoder, 650M parameters. ELECTRA-style corrective
MLM pre-training on 83.6M ATLAS + PDB structures. The encoder reads three
per-residue token streams (seq / bb / fa) and outputs a per-residue embedding
of dimension 1280 (fp16).
Architecture: embed_dim=1280, encoder_depth=30, encoder_heads=20.
Part of the k-fold-structure release.
Files
650M.ckpt: full Lightning checkpoint.config.yaml: model + data config. The path fields are placeholders, point them at your local data.backbone_tokenizer.pt,fullatom_tokenizer.pt: structure tokenizers (PDB โ token IDs). See Acknowledgements.
Usage
pip install torch huggingface_hub omegaconf numpy lmdb biotite
git clone https://github.com/<github-org>/k-fold-structure-release.git
cd k-fold-structure-release
import sys; sys.path.insert(0, "code/triprorep")
from inference import load_encoder, embed_pdb
encoder = load_encoder("650M", hf_repo="k-fold-structure/triprorep-650M")
features = embed_pdb(encoder, "your_protein.pdb",
hf_repo="k-fold-structure/triprorep-650M")
print(features.shape) # (L, 1280) fp16
embed_pdb downloads the bundled tokenizers from this repo on first call,
then runs PDB โ (seq, bb, fa) tokens โ encoder. If you already have token
IDs (e.g. from k-fold-structure/repsp-triprorep-tokens), call
encode(encoder, seq, bb, fa) directly. For CPU, pass device="cpu" to
load_encoder.
Acknowledgements
backbone_tokenizer.pt (aminoaseed VQ-VAE) is from
StructTokenBench.
Citation
@misc{kfoldstructure,
title = {K-Fold Structure: Structure-Aware Protein Encoders and a Per-Residue Representation Benchmark},
author = {<authors>},
year = {2026},
url = {https://huggingface.co/k-fold-structure}
}
License
MIT
- Downloads last month
- 35
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support