Spaces:

hugging-science
/

ESM2

Running

App Files Files Community

gabboud commited on Feb 27

Commit

fdc5e1b

1 Parent(s): 630d8be

description text

Browse files

Files changed (1) hide show

app.py +21 -41

app.py CHANGED Viewed

@@ -16,8 +16,8 @@ print("Downloading ESM2 models...")
 MODELS = {
     "facebook/esm2_t6_8M_UR50D": "ESM2-8M",
-    "facebook/esm2_t12_35M_UR50D": "ESM2-35M",
-    "facebook/esm2_t33_650M_UR50D": "ESM2-650M"
 }
 cache_dirs = cache_all_models(MODELS)
@@ -27,26 +27,25 @@ models_and_tokenizers = load_all_models(MODELS)
 # Create Gradio interface
 with gr.Blocks(title="ESM2 Protein Embeddings") as demo:
     gr.Markdown("""
-    # ESM2 Protein Sequence Embeddings
-    Generate embeddings for protein sequences using Meta's ESM2 language model.
-    **Features:**
-    - Process one or multiple FASTA files
-    - Generate high-dimensional embeddings (1280-D) using ESM2-650M
-    - Download embeddings in NumPy format or as JSON metadata
-    - Supports batch processing for efficiency
-    **Instructions:**
-    1. Upload one or more FASTA files containing protein sequences
-    2. Click "Generate Embeddings"
-    3. Download the output files (embeddings.npz, metadata.json, summary.txt)
-    **Output Files:**
-    - `embeddings.npz`: Compressed NumPy file with all embeddings
-    - `metadata.json`: JSON file with sequence IDs and metadata
-    - `summary.txt`: Human-readable summary
-    - `embeddings_[filename].npz`: Per-file embeddings
     """)
     with gr.Row():
@@ -88,8 +87,8 @@ with gr.Blocks(title="ESM2 Protein Embeddings") as demo:
                     )
                 with gr.TabItem("Calculate Pseudo-Perplexity scores"):
                     with gr.Row():
-                        ppl_button = gr.Button("Calculate Exact Pseudo-Perplexity", variant="primary", size="lg")
-                        ppl_approx_button = gr.Button("Calculate Approximate Pseudo-Perplexity", variant="primary", size="lg")
                     ppl_status = gr.Textbox(
                         label="Waiting for pseudo-perplexity calculation...",
                         interactive=False,
@@ -133,25 +132,6 @@ with gr.Blocks(title="ESM2 Protein Embeddings") as demo:
     )
-    gr.Markdown("""
-    ### How to use the embeddings:
-    ```python
-    import numpy as np
-    import json
-    # Load embeddings
-    embeddings = np.load('embeddings.npz')
-    # Access a specific embedding
-    embedding = embeddings['file_name_sequence_id']
-    # Load metadata
-    with open('metadata.json', 'r') as f:
-        metadata = json.load(f)
-    ```
-    """)
 if __name__ == "__main__":

 MODELS = {
     "facebook/esm2_t6_8M_UR50D": "ESM2-8M",
+    #"facebook/esm2_t12_35M_UR50D": "ESM2-35M",
+    #"facebook/esm2_t33_650M_UR50D": "ESM2-650M"
 }
 cache_dirs = cache_all_models(MODELS)
 # Create Gradio interface
 with gr.Blocks(title="ESM2 Protein Embeddings") as demo:
     gr.Markdown("""
+    # ESM2 for candidate sequence filtering 🤖
+    Once one has generated de novo protein sequences using a tool like LigandMPNN, one must rank them to select promising candidates for experimental validation. One powerful approach is to use protein language models like Meta's ESM2.
+    These language models rely on a BERT-like architecture and a Masked Language Modeling (MLM) objective to learn rich representations of protein sequences. ESM can be used for two main purposes in the context of protein design:
+    1. **Generating embeddings**: ESM's hidden layers creates high-dimensional representations of protein sequences that capture structural and functional information.
+        These embeddings can be used as input features for downstream machine learning models to predict function, properties or even for folding.
+        Embeddings can also be used with dimensionality reduction techniques like t-SNE to visualize to identify clusters or compare against known proteins.
+    2. **Calculating pseudo-perplexity scores (PPL)**: The lower this score is for a given input sequence, the more "natural" or "plausible" it is according to the model's learned distribution.
+        Such scores are often used as a filtering criterion in de novo design, as sequences with lower PPL are more likely to express properly in the lab and fold into stable structures.
+        PPL scores provide an orthogonal evaluation metric to structure-based methods like RosettaFold.
+    ## How to use this Space:
+    - **Choose the ESM2 model:** models mainly differ by the number of parameters (8M, 35M, 650M). Larger models produce better PPL scores and richer embeddings but have longer runtimes.
+    - **Upload one or more FASTA files** containing your candidate sequences.
+    - **Choose the batch size:** it controls how many sequences are processed together. Larger batch sizes can speed up processing but require more GPU memory.
+    - **Choose between generating embeddings or calculating pseudo-perplexity scores.**
+    Note that calculating PPL scores is much more computationally intensive than generating embeddings, it scales cubically with sequence length $L$. This is because calculating PPL requires $L$ forward passes through the model, each with a different token masked out.
+    For long sequences or large numbers of sequences, we recommend using the approximate PPL calculation, which masks 10% of tokens at a time and thus only scales quadratically with sequence length. This provides a good tradeoff between accuracy and runtime.
     """)
     with gr.Row():
                     )
                 with gr.TabItem("Calculate Pseudo-Perplexity scores"):
                     with gr.Row():
+                        ppl_button = gr.Button("Calculate Exact PPL", variant="primary", size="lg")
+                        ppl_approx_button = gr.Button("Calculate Approximate PPL", variant="primary", size="lg")
                     ppl_status = gr.Textbox(
                         label="Waiting for pseudo-perplexity calculation...",
                         interactive=False,
     )
 if __name__ == "__main__":