Anonymoususer2223 commited on
Commit
8c2c758
Β·
1 Parent(s): 96d15a8
Files changed (1) hide show
  1. README.md +102 -0
README.md CHANGED
@@ -1,3 +1,105 @@
1
  ---
2
  license: mit
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
  ---
4
+
5
+ # ProtCompass Embeddings
6
+
7
+ Pre-computed protein embeddings from 70+ encoders across 13 downstream tasks.
8
+
9
+ ## Dataset Structure
10
+
11
+ ```
12
+ embeddings/
13
+ β”œβ”€β”€ secondary_structure/ # CB513 dataset (29 GB)
14
+ β”œβ”€β”€ mutation_effect/ # ProteinGym DMS assays (4.5 GB)
15
+ β”œβ”€β”€ contact_prediction/ # ProteinNet (2.9 GB)
16
+ β”œβ”€β”€ stability/ # TAPE stability (1.6 GB)
17
+ β”œβ”€β”€ ppi_site/ # PPI site prediction (1.4 GB)
18
+ β”œβ”€β”€ fluorescence/ # GFP fluorescence (841 MB)
19
+ β”œβ”€β”€ metal_binding/ # Metal binding sites (570 MB)
20
+ β”œβ”€β”€ go_bp/ # GO Biological Process (214 MB)
21
+ β”œβ”€β”€ go_mf/ # GO Molecular Function (68 MB)
22
+ β”œβ”€β”€ remote_homology/ # SCOPe fold classification (20 MB)
23
+ β”œβ”€β”€ ec_classification/ # Enzyme classification (18 MB)
24
+ β”œβ”€β”€ membrane_soluble/ # Membrane/soluble (17 MB)
25
+ └── subcellular_localization/ # Subcellular location (17 MB)
26
+ ```
27
+
28
+ ## File Format
29
+
30
+ Each encoder directory contains:
31
+ - `train_embeddings.npy`: Training set embeddings (N Γ— D)
32
+ - `test_embeddings.npy`: Test set embeddings (M Γ— D)
33
+ - `train_labels.npy`: Training labels
34
+ - `test_labels.npy`: Test labels
35
+ - `train_ids.txt`: Protein IDs for training set
36
+ - `test_ids.txt`: Protein IDs for test set
37
+ - `meta.json`: Metadata (encoder name, dimensions, dataset info)
38
+
39
+ ## Usage
40
+
41
+ ```python
42
+ import numpy as np
43
+ from huggingface_hub import hf_hub_download
44
+
45
+ # Download specific encoder embeddings
46
+ train_emb = np.load(hf_hub_download(
47
+ repo_id="Anonymoususer2223/ProtCompass_Embeddings",
48
+ filename="embeddings/mutation_effect/esm2/train_embeddings.npy",
49
+ repo_type="dataset"
50
+ ))
51
+
52
+ test_emb = np.load(hf_hub_download(
53
+ repo_id="Anonymoususer2223/ProtCompass_Embeddings",
54
+ filename="embeddings/mutation_effect/esm2/test_embeddings.npy",
55
+ repo_type="dataset"
56
+ ))
57
+
58
+ # Use for downstream tasks
59
+ from sklearn.linear_model import Ridge
60
+ model = Ridge()
61
+ model.fit(train_emb, train_labels)
62
+ score = model.score(test_emb, test_labels)
63
+ ```
64
+
65
+ ## Encoders Included
66
+
67
+ ### Sequence Encoders (8)
68
+ ESM2 (650M, 150M, 35M), ESM1b, ESM3, ProtTrans, ProST-T5, ProteinBERT-BFD, Ankh
69
+
70
+ ### Structure Encoders (50+)
71
+ GearNet, GCPNet, EGNN, GVP, IPA, TFN, SchNet, DimeNet, MACE, CDConv, ProteinMPNN, PottsMP, dMaSIF
72
+
73
+ ### Multimodal Encoders (5)
74
+ SaProt, ESM-IF, FoldVision
75
+
76
+ ### Baselines
77
+ Random, Length, Torsion, One-hot, BLOSUM
78
+
79
+ ## Dataset Statistics
80
+
81
+ - **Total size**: 41 GB
82
+ - **Total encoders**: 70+
83
+ - **Total tasks**: 13
84
+ - **Total proteins**: ~500K across all tasks
85
+
86
+ ## Citation
87
+
88
+ If you use these embeddings, please cite:
89
+
90
+ ```bibtex
91
+ @article{protcompass2026,
92
+ title={ProtCompass: Systematic Evaluation of Protein Structure Encoders},
93
+ author={Your Name et al.},
94
+ journal={NeurIPS},
95
+ year={2026}
96
+ }
97
+ ```
98
+
99
+ ## License
100
+
101
+ MIT License
102
+
103
+ ## Contact
104
+
105
+ For questions or issues, please open an issue on the repository.