gabboud commited on
Commit
fdc5e1b
·
1 Parent(s): 630d8be

description text

Browse files
Files changed (1) hide show
  1. app.py +21 -41
app.py CHANGED
@@ -16,8 +16,8 @@ print("Downloading ESM2 models...")
16
 
17
  MODELS = {
18
  "facebook/esm2_t6_8M_UR50D": "ESM2-8M",
19
- "facebook/esm2_t12_35M_UR50D": "ESM2-35M",
20
- "facebook/esm2_t33_650M_UR50D": "ESM2-650M"
21
  }
22
 
23
  cache_dirs = cache_all_models(MODELS)
@@ -27,26 +27,25 @@ models_and_tokenizers = load_all_models(MODELS)
27
  # Create Gradio interface
28
  with gr.Blocks(title="ESM2 Protein Embeddings") as demo:
29
  gr.Markdown("""
30
- # ESM2 Protein Sequence Embeddings
 
 
 
 
 
 
 
 
 
31
 
32
- Generate embeddings for protein sequences using Meta's ESM2 language model.
 
 
 
 
33
 
34
- **Features:**
35
- - Process one or multiple FASTA files
36
- - Generate high-dimensional embeddings (1280-D) using ESM2-650M
37
- - Download embeddings in NumPy format or as JSON metadata
38
- - Supports batch processing for efficiency
39
-
40
- **Instructions:**
41
- 1. Upload one or more FASTA files containing protein sequences
42
- 2. Click "Generate Embeddings"
43
- 3. Download the output files (embeddings.npz, metadata.json, summary.txt)
44
-
45
- **Output Files:**
46
- - `embeddings.npz`: Compressed NumPy file with all embeddings
47
- - `metadata.json`: JSON file with sequence IDs and metadata
48
- - `summary.txt`: Human-readable summary
49
- - `embeddings_[filename].npz`: Per-file embeddings
50
  """)
51
 
52
  with gr.Row():
@@ -88,8 +87,8 @@ with gr.Blocks(title="ESM2 Protein Embeddings") as demo:
88
  )
89
  with gr.TabItem("Calculate Pseudo-Perplexity scores"):
90
  with gr.Row():
91
- ppl_button = gr.Button("Calculate Exact Pseudo-Perplexity", variant="primary", size="lg")
92
- ppl_approx_button = gr.Button("Calculate Approximate Pseudo-Perplexity", variant="primary", size="lg")
93
  ppl_status = gr.Textbox(
94
  label="Waiting for pseudo-perplexity calculation...",
95
  interactive=False,
@@ -133,25 +132,6 @@ with gr.Blocks(title="ESM2 Protein Embeddings") as demo:
133
  )
134
 
135
 
136
-
137
- gr.Markdown("""
138
- ### How to use the embeddings:
139
-
140
- ```python
141
- import numpy as np
142
- import json
143
-
144
- # Load embeddings
145
- embeddings = np.load('embeddings.npz')
146
-
147
- # Access a specific embedding
148
- embedding = embeddings['file_name_sequence_id']
149
-
150
- # Load metadata
151
- with open('metadata.json', 'r') as f:
152
- metadata = json.load(f)
153
- ```
154
- """)
155
 
156
 
157
  if __name__ == "__main__":
 
16
 
17
  MODELS = {
18
  "facebook/esm2_t6_8M_UR50D": "ESM2-8M",
19
+ #"facebook/esm2_t12_35M_UR50D": "ESM2-35M",
20
+ #"facebook/esm2_t33_650M_UR50D": "ESM2-650M"
21
  }
22
 
23
  cache_dirs = cache_all_models(MODELS)
 
27
  # Create Gradio interface
28
  with gr.Blocks(title="ESM2 Protein Embeddings") as demo:
29
  gr.Markdown("""
30
+ # ESM2 for candidate sequence filtering 🤖
31
+
32
+ Once one has generated de novo protein sequences using a tool like LigandMPNN, one must rank them to select promising candidates for experimental validation. One powerful approach is to use protein language models like Meta's ESM2.
33
+ These language models rely on a BERT-like architecture and a Masked Language Modeling (MLM) objective to learn rich representations of protein sequences. ESM can be used for two main purposes in the context of protein design:
34
+ 1. **Generating embeddings**: ESM's hidden layers creates high-dimensional representations of protein sequences that capture structural and functional information.
35
+ These embeddings can be used as input features for downstream machine learning models to predict function, properties or even for folding.
36
+ Embeddings can also be used with dimensionality reduction techniques like t-SNE to visualize to identify clusters or compare against known proteins.
37
+ 2. **Calculating pseudo-perplexity scores (PPL)**: The lower this score is for a given input sequence, the more "natural" or "plausible" it is according to the model's learned distribution.
38
+ Such scores are often used as a filtering criterion in de novo design, as sequences with lower PPL are more likely to express properly in the lab and fold into stable structures.
39
+ PPL scores provide an orthogonal evaluation metric to structure-based methods like RosettaFold.
40
 
41
+ ## How to use this Space:
42
+ - **Choose the ESM2 model:** models mainly differ by the number of parameters (8M, 35M, 650M). Larger models produce better PPL scores and richer embeddings but have longer runtimes.
43
+ - **Upload one or more FASTA files** containing your candidate sequences.
44
+ - **Choose the batch size:** it controls how many sequences are processed together. Larger batch sizes can speed up processing but require more GPU memory.
45
+ - **Choose between generating embeddings or calculating pseudo-perplexity scores.**
46
 
47
+ Note that calculating PPL scores is much more computationally intensive than generating embeddings, it scales cubically with sequence length $L$. This is because calculating PPL requires $L$ forward passes through the model, each with a different token masked out.
48
+ For long sequences or large numbers of sequences, we recommend using the approximate PPL calculation, which masks 10% of tokens at a time and thus only scales quadratically with sequence length. This provides a good tradeoff between accuracy and runtime.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
  """)
50
 
51
  with gr.Row():
 
87
  )
88
  with gr.TabItem("Calculate Pseudo-Perplexity scores"):
89
  with gr.Row():
90
+ ppl_button = gr.Button("Calculate Exact PPL", variant="primary", size="lg")
91
+ ppl_approx_button = gr.Button("Calculate Approximate PPL", variant="primary", size="lg")
92
  ppl_status = gr.Textbox(
93
  label="Waiting for pseudo-perplexity calculation...",
94
  interactive=False,
 
132
  )
133
 
134
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
135
 
136
 
137
  if __name__ == "__main__":