Spaces:
Running
Running
description text
Browse files
app.py
CHANGED
|
@@ -16,8 +16,8 @@ print("Downloading ESM2 models...")
|
|
| 16 |
|
| 17 |
MODELS = {
|
| 18 |
"facebook/esm2_t6_8M_UR50D": "ESM2-8M",
|
| 19 |
-
"facebook/esm2_t12_35M_UR50D": "ESM2-35M",
|
| 20 |
-
"facebook/esm2_t33_650M_UR50D": "ESM2-650M"
|
| 21 |
}
|
| 22 |
|
| 23 |
cache_dirs = cache_all_models(MODELS)
|
|
@@ -27,26 +27,25 @@ models_and_tokenizers = load_all_models(MODELS)
|
|
| 27 |
# Create Gradio interface
|
| 28 |
with gr.Blocks(title="ESM2 Protein Embeddings") as demo:
|
| 29 |
gr.Markdown("""
|
| 30 |
-
# ESM2
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
- Generate high-dimensional embeddings (1280-D) using ESM2-650M
|
| 37 |
-
- Download embeddings in NumPy format or as JSON metadata
|
| 38 |
-
- Supports batch processing for efficiency
|
| 39 |
-
|
| 40 |
-
**Instructions:**
|
| 41 |
-
1. Upload one or more FASTA files containing protein sequences
|
| 42 |
-
2. Click "Generate Embeddings"
|
| 43 |
-
3. Download the output files (embeddings.npz, metadata.json, summary.txt)
|
| 44 |
-
|
| 45 |
-
**Output Files:**
|
| 46 |
-
- `embeddings.npz`: Compressed NumPy file with all embeddings
|
| 47 |
-
- `metadata.json`: JSON file with sequence IDs and metadata
|
| 48 |
-
- `summary.txt`: Human-readable summary
|
| 49 |
-
- `embeddings_[filename].npz`: Per-file embeddings
|
| 50 |
""")
|
| 51 |
|
| 52 |
with gr.Row():
|
|
@@ -88,8 +87,8 @@ with gr.Blocks(title="ESM2 Protein Embeddings") as demo:
|
|
| 88 |
)
|
| 89 |
with gr.TabItem("Calculate Pseudo-Perplexity scores"):
|
| 90 |
with gr.Row():
|
| 91 |
-
ppl_button = gr.Button("Calculate Exact
|
| 92 |
-
ppl_approx_button = gr.Button("Calculate Approximate
|
| 93 |
ppl_status = gr.Textbox(
|
| 94 |
label="Waiting for pseudo-perplexity calculation...",
|
| 95 |
interactive=False,
|
|
@@ -133,25 +132,6 @@ with gr.Blocks(title="ESM2 Protein Embeddings") as demo:
|
|
| 133 |
)
|
| 134 |
|
| 135 |
|
| 136 |
-
|
| 137 |
-
gr.Markdown("""
|
| 138 |
-
### How to use the embeddings:
|
| 139 |
-
|
| 140 |
-
```python
|
| 141 |
-
import numpy as np
|
| 142 |
-
import json
|
| 143 |
-
|
| 144 |
-
# Load embeddings
|
| 145 |
-
embeddings = np.load('embeddings.npz')
|
| 146 |
-
|
| 147 |
-
# Access a specific embedding
|
| 148 |
-
embedding = embeddings['file_name_sequence_id']
|
| 149 |
-
|
| 150 |
-
# Load metadata
|
| 151 |
-
with open('metadata.json', 'r') as f:
|
| 152 |
-
metadata = json.load(f)
|
| 153 |
-
```
|
| 154 |
-
""")
|
| 155 |
|
| 156 |
|
| 157 |
if __name__ == "__main__":
|
|
|
|
| 16 |
|
| 17 |
MODELS = {
|
| 18 |
"facebook/esm2_t6_8M_UR50D": "ESM2-8M",
|
| 19 |
+
#"facebook/esm2_t12_35M_UR50D": "ESM2-35M",
|
| 20 |
+
#"facebook/esm2_t33_650M_UR50D": "ESM2-650M"
|
| 21 |
}
|
| 22 |
|
| 23 |
cache_dirs = cache_all_models(MODELS)
|
|
|
|
| 27 |
# Create Gradio interface
|
| 28 |
with gr.Blocks(title="ESM2 Protein Embeddings") as demo:
|
| 29 |
gr.Markdown("""
|
| 30 |
+
# ESM2 for candidate sequence filtering 🤖
|
| 31 |
+
|
| 32 |
+
Once one has generated de novo protein sequences using a tool like LigandMPNN, one must rank them to select promising candidates for experimental validation. One powerful approach is to use protein language models like Meta's ESM2.
|
| 33 |
+
These language models rely on a BERT-like architecture and a Masked Language Modeling (MLM) objective to learn rich representations of protein sequences. ESM can be used for two main purposes in the context of protein design:
|
| 34 |
+
1. **Generating embeddings**: ESM's hidden layers creates high-dimensional representations of protein sequences that capture structural and functional information.
|
| 35 |
+
These embeddings can be used as input features for downstream machine learning models to predict function, properties or even for folding.
|
| 36 |
+
Embeddings can also be used with dimensionality reduction techniques like t-SNE to visualize to identify clusters or compare against known proteins.
|
| 37 |
+
2. **Calculating pseudo-perplexity scores (PPL)**: The lower this score is for a given input sequence, the more "natural" or "plausible" it is according to the model's learned distribution.
|
| 38 |
+
Such scores are often used as a filtering criterion in de novo design, as sequences with lower PPL are more likely to express properly in the lab and fold into stable structures.
|
| 39 |
+
PPL scores provide an orthogonal evaluation metric to structure-based methods like RosettaFold.
|
| 40 |
|
| 41 |
+
## How to use this Space:
|
| 42 |
+
- **Choose the ESM2 model:** models mainly differ by the number of parameters (8M, 35M, 650M). Larger models produce better PPL scores and richer embeddings but have longer runtimes.
|
| 43 |
+
- **Upload one or more FASTA files** containing your candidate sequences.
|
| 44 |
+
- **Choose the batch size:** it controls how many sequences are processed together. Larger batch sizes can speed up processing but require more GPU memory.
|
| 45 |
+
- **Choose between generating embeddings or calculating pseudo-perplexity scores.**
|
| 46 |
|
| 47 |
+
Note that calculating PPL scores is much more computationally intensive than generating embeddings, it scales cubically with sequence length $L$. This is because calculating PPL requires $L$ forward passes through the model, each with a different token masked out.
|
| 48 |
+
For long sequences or large numbers of sequences, we recommend using the approximate PPL calculation, which masks 10% of tokens at a time and thus only scales quadratically with sequence length. This provides a good tradeoff between accuracy and runtime.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
""")
|
| 50 |
|
| 51 |
with gr.Row():
|
|
|
|
| 87 |
)
|
| 88 |
with gr.TabItem("Calculate Pseudo-Perplexity scores"):
|
| 89 |
with gr.Row():
|
| 90 |
+
ppl_button = gr.Button("Calculate Exact PPL", variant="primary", size="lg")
|
| 91 |
+
ppl_approx_button = gr.Button("Calculate Approximate PPL", variant="primary", size="lg")
|
| 92 |
ppl_status = gr.Textbox(
|
| 93 |
label="Waiting for pseudo-perplexity calculation...",
|
| 94 |
interactive=False,
|
|
|
|
| 132 |
)
|
| 133 |
|
| 134 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 135 |
|
| 136 |
|
| 137 |
if __name__ == "__main__":
|