metadata
license: mit
ProtCompass Embeddings
Pre-computed protein embeddings from 70+ encoders across 13 downstream tasks.
Dataset Structure
embeddings/
βββ secondary_structure/ # CB513 dataset (29 GB)
βββ mutation_effect/ # ProteinGym DMS assays (4.5 GB)
βββ contact_prediction/ # ProteinNet (2.9 GB)
βββ stability/ # TAPE stability (1.6 GB)
βββ ppi_site/ # PPI site prediction (1.4 GB)
βββ fluorescence/ # GFP fluorescence (841 MB)
βββ metal_binding/ # Metal binding sites (570 MB)
βββ go_bp/ # GO Biological Process (214 MB)
βββ go_mf/ # GO Molecular Function (68 MB)
βββ remote_homology/ # SCOPe fold classification (20 MB)
βββ ec_classification/ # Enzyme classification (18 MB)
βββ membrane_soluble/ # Membrane/soluble (17 MB)
βββ subcellular_localization/ # Subcellular location (17 MB)
File Format
Each encoder directory contains:
train_embeddings.npy: Training set embeddings (N Γ D)test_embeddings.npy: Test set embeddings (M Γ D)train_labels.npy: Training labelstest_labels.npy: Test labelstrain_ids.txt: Protein IDs for training settest_ids.txt: Protein IDs for test setmeta.json: Metadata (encoder name, dimensions, dataset info)
Usage
import numpy as np
from huggingface_hub import hf_hub_download
# Download specific encoder embeddings
train_emb = np.load(hf_hub_download(
repo_id="Anonymoususer2223/ProtCompass_Embeddings",
filename="embeddings/mutation_effect/esm2/train_embeddings.npy",
repo_type="dataset"
))
test_emb = np.load(hf_hub_download(
repo_id="Anonymoususer2223/ProtCompass_Embeddings",
filename="embeddings/mutation_effect/esm2/test_embeddings.npy",
repo_type="dataset"
))
# Use for downstream tasks
from sklearn.linear_model import Ridge
model = Ridge()
model.fit(train_emb, train_labels)
score = model.score(test_emb, test_labels)
Encoders Included
Sequence Encoders (8)
ESM2 (650M, 150M, 35M), ESM1b, ESM3, ProtTrans, ProST-T5, ProteinBERT-BFD, Ankh
Structure Encoders (50+)
GearNet, GCPNet, EGNN, GVP, IPA, TFN, SchNet, DimeNet, MACE, CDConv, ProteinMPNN, PottsMP, dMaSIF
Multimodal Encoders (5)
SaProt, ESM-IF, FoldVision
Baselines
Random, Length, Torsion, One-hot, BLOSUM
Dataset Statistics
- Total size: 41 GB
- Total encoders: 70+
- Total tasks: 13
- Total proteins: ~500K across all tasks
Citation
If you use these embeddings, please cite:
@article{protcompass2026,
title={ProtCompass: Systematic Evaluation of Protein Structure Encoders},
author={Your Name et al.},
journal={NeurIPS},
year={2026}
}
License
MIT License
Contact
For questions or issues, please open an issue on the repository.