ESM-2 for Generating and Optimizing Peptide Binders for Target Proteins

Community Article Published November 23, 2023

In this article we will discuss finetuning ESM-2 for generating peptide binders for target proteins using the masked language modeling capabilities of ESM-2. We will then discuss how to perform in silico directed evolution on the protein-peptide complex to optimize the peptide binder. We will also discuss how this can be extended to general interacting pairs of proteins.

Introduction

In the rapidly advancing field of protein engineering, the use of AI and machine learning techniques has become increasingly prominent. Among these, ESM-2, a state-of-the-art language model specifically tailored for protein sequences, has shown exceptional promise. This article delves into the practical application of ESM-2 for the generation and optimization of peptide binders targeting specific proteins. We will explore the process of fine-tuning ESM-2 using the masked language modeling approach and demonstrate how it can be utilized to produce peptide binders. Furthermore, we will discuss the technique of in silico directed evolution, particularly focusing on protein-peptide complexes, to enhance the binding affinity of these peptides. This exploration not only highlights the potential of ESM-2 in protein engineering but also sets the stage for its broader application in understanding and designing protein-protein interactions.

We will be using the model PepMLM, which is a finetuned version of ESM-2 but finetuning your own version of ESM-2 can easily be done by first cloning the Github repo for PepMLM. Head over to the Github repo and clone it! Next, all you need to do is run the training script once you have adjusted the file paths to the training data and test data. The model is quite fast to train on a modest GPU. You might also try finetuning ESM-2 on your own dataset from UniProt using the protein-protein interaction data available. This might yield better results if your interaction dataset is larger.

Generating Peptide Binders with PepMLM

First, we need to pick a protein from UniProt as our target protein. For this experiment, we will use the human protein P63279. Below you can see the 3D folded protein structure as predicted by ESMFold.

image/png

Once we have our protein sequence in hand, we need to define some functions to use either our newly trained version of ESM-2, or the original PepMLM. Here we will just use the original PepMLM for simplicity, but you can easily substitute in your newly finetuned ESM-2 model.

from transformers import AutoTokenizer, AutoModelForMaskedLM  
import torch
import pandas as pd
import numpy as np
from torch.distributions import Categorical

def compute_pseudo_perplexity(model, tokenizer, protein_seq, binder_seq):
    sequence = protein_seq + binder_seq
    original_input = tokenizer.encode(sequence, return_tensors='pt').to(model.device)
    length_of_binder = len(binder_seq)

    # Prepare a batch with each row having one masked token from the binder sequence
    masked_inputs = original_input.repeat(length_of_binder, 1)
    positions_to_mask = torch.arange(-length_of_binder - 1, -1, device=model.device)

    masked_inputs[torch.arange(length_of_binder), positions_to_mask] = tokenizer.mask_token_id

    # Prepare labels for the masked tokens
    labels = torch.full_like(masked_inputs, -100)
    labels[torch.arange(length_of_binder), positions_to_mask] = original_input[0, positions_to_mask]

    # Get model predictions and calculate loss
    with torch.no_grad():
        outputs = model(masked_inputs, labels=labels)
        loss = outputs.loss

    # Loss is already averaged by the model
    avg_loss = loss.item()
    pseudo_perplexity = np.exp(avg_loss)
    return pseudo_perplexity


def generate_peptide_for_single_sequence(protein_seq, peptide_length = 15, top_k = 3, num_binders = 4):

    peptide_length = int(peptide_length)
    top_k = int(top_k)
    num_binders = int(num_binders)

    binders_with_ppl = []

    for _ in range(num_binders):
        # Generate binder
        masked_peptide = '<mask>' * peptide_length
        input_sequence = protein_seq + masked_peptide
        inputs = tokenizer(input_sequence, return_tensors="pt").to(model.device)

        with torch.no_grad():
            logits = model(**inputs).logits
        mask_token_indices = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
        logits_at_masks = logits[0, mask_token_indices]

        # Apply top-k sampling
        top_k_logits, top_k_indices = logits_at_masks.topk(top_k, dim=-1)
        probabilities = torch.nn.functional.softmax(top_k_logits, dim=-1)
        predicted_indices = Categorical(probabilities).sample()
        predicted_token_ids = top_k_indices.gather(-1, predicted_indices.unsqueeze(-1)).squeeze(-1)

        generated_binder = tokenizer.decode(predicted_token_ids, skip_special_tokens=True).replace(' ', '')

        # Compute PPL for the generated binder
        ppl_value = compute_pseudo_perplexity(model, tokenizer, protein_seq, generated_binder)

        # Add the generated binder and its PPL to the results list
        binders_with_ppl.append([generated_binder, ppl_value])

    return binders_with_ppl

def generate_peptide(input_seqs, peptide_length=15, top_k=3, num_binders=4):
    if isinstance(input_seqs, str):  # Single sequence
        binders = generate_peptide_for_single_sequence(input_seqs, peptide_length, top_k, num_binders)
        return pd.DataFrame(binders, columns=['Binder', 'Pseudo Perplexity'])

    elif isinstance(input_seqs, list):  # List of sequences
        results = []
        for seq in input_seqs:
            binders = generate_peptide_for_single_sequence(seq, peptide_length, top_k, num_binders)
            for binder, ppl in binders:
                results.append([seq, binder, ppl])
        return pd.DataFrame(results, columns=['Input Sequence', 'Binder', 'Pseudo Perplexity'])
    
model = AutoModelForMaskedLM.from_pretrained("TianlaiChen/PepMLM-650M")
tokenizer = AutoTokenizer.from_pretrained("TianlaiChen/PepMLM-650M")

protein_seq = "MSGIALSRLAQERKAWRKDHPFGFVAVPTKNPDGTMNLMNWECAIPGKKGTPWEGGLFKLRMLFKDDYPSSPPKCKFEPPLFHPNVYPSGTVCLSILEEDKDWRPAITIKQILLGIQELLNEPNIQDPAQAEAYTIYCQNRVEYEKRVRAQAKKFAPS"

results_df = generate_peptide(protein_seq, peptide_length=15, top_k=3, num_binders=5)
print(results_df)

This code will print five peptide that are likely to bind to our target protein of interest. In particular, you should see something like the following:

Binder  Pseudo Perplexity
0  FQEEPPPLRLAALLL          11.627843
1  EDEDDPEPRYALELE           9.462782
2  FDEDDPLAPRLLEEE           8.092850
3  EQEDPPLPLYALAEE          11.335873
4  FDGEPPLARRLLAKL          10.967774

Choosing the one with the lowest perplexity, FDEDDPLAPRLLEEE, we can query ESMFold to predict the structure of the target protein, linked to the peptide binder with a long flexible linker of 20 G amino acids.

image/png

As we can see, the confidence of the structure is actually quite high, but the confidence around the peptide has some yellow regions denoting lower confidence. We can improve the confidence by performing directed evolution on the peptide region using EvoProtGrad.

In Silico Directed Evolution of the Peptide Binder with EvoProtGrad and ESM-2

Next, we will perform (in silico) directed evolution of the peptide binder only in an attempt to improve its binding affinity to our target protein.

import evo_prot_grad
from transformers import AutoTokenizer, EsmForMaskedLM

def run_evo_prot_grad_on_paired_sequence(paired_protein_sequence):
    # Replace ':' with a string of 20 'G' amino acids
    separator = 'G' * 20
    sequence_with_separator = paired_protein_sequence.replace(':', separator)

    # Determine the start and end indices of the first protein and the separator
    separator_start_index = sequence_with_separator.find(separator)
    first_protein_end_index = separator_start_index
    separator_end_index = separator_start_index + len(separator)

    # Format the sequence into FASTA format
    fasta_format_sequence = f">Paired_Protein_Sequence\n{sequence_with_separator}"

    # Save the sequence to a temporary file
    temp_fasta_path = "temp_paired_sequence.fasta"
    with open(temp_fasta_path, "w") as file:
        file.write(fasta_format_sequence)

    # Load the ESM-2 model and tokenizer as the expert
    esm2_expert = evo_prot_grad.get_expert(
        'esm',
        model=EsmForMaskedLM.from_pretrained("facebook/esm2_t33_650M_UR50D"),
        tokenizer=AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D"),
        temperature=0.95,
        device='cuda'  # or 'cpu' if GPU is not available
    )

    # Initialize Directed Evolution with the preserved first protein and separator region
    directed_evolution = evo_prot_grad.DirectedEvolution(
        wt_fasta=temp_fasta_path,
        output='best',
        experts=[esm2_expert],
        parallel_chains=1,
        n_steps=50,
        max_mutations=15,
        verbose=True,
        preserved_regions=[(0, first_protein_end_index), (separator_start_index, separator_end_index)]  # Preserve the first protein and the 'G' amino acids string
    )

    # Run the evolution process
    variants, scores = directed_evolution()

    # Process the results and split them into Protein 1 and Protein 2
    for variant, score in zip(variants, scores):
        # Remove spaces from the sequence
        evolved_sequence_no_spaces = variant.replace(" ", "")

        # Split the sequence at the separator
        protein_1, protein_2 = evolved_sequence_no_spaces.split(separator)

        print(f"Protein: {protein_1}, Evolved Peptide: {protein_2}, Score: {score}")

# Example usage
paired_protein_sequence = "MSGIALSRLAQERKAWRKDHPFGFVAVPTKNPDGTMNLMNWECAIPGKKGTPWEGGLFKLRMLFKDDYPSSPPKCKFEPPLFHPNVYPSGTVCLSILEEDKDWRPAITIKQILLGIQELLNEPNIQDPAQAEAYTIYCQNRVEYEKRVRAQAKKFAPS:FDEDDPLAPRLLEEE"  # Replace with your paired protein sequences
run_evo_prot_grad_on_paired_sequence(paired_protein_sequence)

This should print 50 iterations of the gradient based MCMC method using ESM-2 as the expert model.

>Wildtype sequence: M S G I A L S R L A Q E R K A W R K D H P F G F V A V P T K N P D G T M N L M N W E C A I P G K K G T P W E G G L F K L R M L F K D D Y P S S P P K C K F E P P L F H P N V Y P S G T V C L S I L E E D K D W R P A I T I K Q I L L G I Q E L L N E P N I Q D P A Q A E A Y T I Y C Q N R V E Y E K R V R A Q A K K F A P S G G G G G G G G G G G G G G G G G G G G F D E D D P L A P R L L E E E
step 0 acceptance rate: 0.1181
>chain 0, Product of Experts score: 0.0000
M S G I A L S R L A Q E R K A W R K D H P F G F V A V P T K N P D G T M N L M N W E C A I P G K K G T P W E G G L F K L R M L F K D D Y P S S P P K C K F E P P L F H P N V Y P S G T V C L S I L E E D K D W R P A I T I K Q I L L G I Q E L L N E P N I Q D P A Q A E A Y T I Y C Q N R V E Y E K R V R A Q A K K F A P S G G G G G G G G G G G G G G G G G G G G F D E D D P L A P R L L E E E
step 1 acceptance rate: 0.3120
>chain 0, Product of Experts score: 0.0000
M S G I A L S R L A Q E R K A W R K D H P F G F V A V P T K N P D G T M N L M N W E C A I P G K K G T P W E G G L F K L R M L F K D D Y P S S P P K C K F E P P L F H P N V Y P S G T V C L S I L E E D K D W R P A I T I K Q I L L G I Q E L L N E P N I Q D P A Q A E A Y T I Y C Q N R V E Y E K R V R A Q A K K F A P S G G G G G G G G G G G G G G G G G G G G F D E D D P L A V R L L E E E
step 2 acceptance rate: 0.0463
>chain 0, Product of Experts score: 1.8138
M S G I A L S R L A Q E R K A W R K D H P F G F V A V P T K N P D G T M N L M N W E C A I P G K K G T P W E G G L F K L R M L F K D D Y P S S P P K C K F E P P L F H P N V Y P S G T V C L S I L E E D K D W R P A I T I K Q I L L G I Q E L L N E P N I Q D P A Q A E A Y T I Y C Q N R V E Y E K R V R A Q A K K F A P S G G G G G G G G G G G G G G G G G G G G F D E D D P L A V R L F E E E
step 3 acceptance rate: 0.1111
.
.
.
>chain 0, Product of Experts score: -2.7335
M S G I A L S R L A Q E R K A W R K D H P F G F V A V P T K N P D G T M N L M N W E C A I P G K K G T P W E G G L F K L R M L F K D D Y P S S P P K C K F E P P L F H P N V Y P S G T V C L S I L E E D K D W R P A I T I K Q I L L G I Q E L L N E P N I Q D P A Q A E A Y T I Y C Q N R V E Y E K R V R A Q A K K F A P S G G G G G G G G G G G G G G G G G G G G F V G I D F S P R V G E D K V
step 48 acceptance rate: 0.0667
>chain 0, Product of Experts score: -2.7335
M S G I A L S R L A Q E R K A W R K D H P F G F V A V P T K N P D G T M N L M N W E C A I P G K K G T P W E G G L F K L R M L F K D D Y P S S P P K C K F E P P L F H P N V Y P S G T V C L S I L E E D K D W R P A I T I K Q I L L G I Q E L L N E P N I Q D P A Q A E A Y T I Y C Q N R V E Y E K R V R A Q A K K F A P S G G G G G G G G G G G G G G G G G G G G F V G I D F S M R V G E D K V
step 49 acceptance rate: 1.9997
>chain 0, Product of Experts score: -4.7062
M S G I A L S R L A Q E R K A W R K D H P F G F V A V P T K N P D G T M N L M N W E C A I P G K K G T P W E G G L F K L R M L F K D D Y P S S P P K C K F E P P L F H P N V Y P S G T V C L S I L E E D K D W R P A I T I K Q I L L G I Q E L L N E P N I Q D P A Q A E A Y T I Y C Q N R V E Y E K R V R A Q A K K F A P S G G G G G G G G G G G G G G G G G G G G F V H I D F S M R V G C D K V
Protein: MSGIALSRLAQERKAWRKDHPFGFVAVPTKNPDGTMNLMNWECAIPGKKGTPWEGGLFKLRMLFKDDYPSSPPKCKFEPPLFHPNVYPSGTVCLSILEEDKDWRPAITIKQILLGIQELLNEPNIQDPAQAEAYTIYCQNRVEYEKRVRAQAKKFAPS, Evolved Peptide:
FSPSDVNKLFGKDED, Score: 2.326873779296875

Note, this process actually increases the perplexity.

model = AutoModelForMaskedLM.from_pretrained("TianlaiChen/PepMLM-650M")
tokenizer = AutoTokenizer.from_pretrained("TianlaiChen/PepMLM-650M")

protein_seq = "MSGIALSRLAQERKAWRKDHPFGFVAVPTKNPDGTMNLMNWECAIPGKKGTPWEGGLFKLRMLFKDDYPSSPPKCKFEPPLFHPNVYPSGTVCLSILEEDKDWRPAITIKQILLGIQELLNEPNIQDPAQAEAYTIYCQNRVEYEKRVRAQAKKFAPS"
binder_seq = "FSPSDVNKLFGKDED"
compute_pseudo_perplexity(model, tokenizer, protein_seq, binder_seq)
21.214165484401022

However, below we see the confidence that ESMFold has in the predicted structure has actually gone up. This suggests that perhaps simply judging the binder based on perplexity is an insufficient metric to determine the quality of the binder.

image/png

Zooming in to see the interaction of the peptide binder with the target protein, we can see the atoms are actually quite close between the our target protein and the peptide binder.

image/png

We also note that there are some nuances that should be considered when computing perplexity. In particular, it has been shown in the NLP domain that geometric compression, as measured by intrinsic dimension, actually predicts perplexity (see for example Estimating the Intrinsic Dimension of Protein Sequence Embeddings using ESM-2 and Bridging Information-Theoretic and Geometric Compression in Language Models). This provides us with a description of how difficult it will be to adapt the model during training or finetuning, and it can provide us with a measure of how "natural" the protein or peptide is, but it does not necessarily provide us with a measure of how good the binder will be in terms of binding affinity. For that we may need another additional method explicitely for predicting binding affinity.

Conclusions

The exploration of ESM-2 and its applications in generating and optimizing peptide binders for target proteins has revealed the immense potential of AI-driven methods in the field of protein engineering. Through the processes of training PepMLM, generating peptide binders, and performing in silico directed evolution using EvoProtGrad, we have demonstrated a comprehensive approach to enhancing peptide-protein interactions. The use of ESM-2 has not only facilitated the identification of potential peptide binders but also allowed for their subsequent optimization, showcasing a significant advancement in the predictive and design capabilities in protein science. This methodology holds promise for a wide range of applications, including drug discovery, therapeutic protein design, and understanding complex biological interactions. The integration of AI and computational biology, as exemplified in this study, represents a significant stride towards more efficient and targeted approaches in molecular biology and bioengineering. For more information on PepMLM, you can read the research paper. For more information on EvoProtGrad, check out the research paper and the Github.