In Silico Directed Evolution of Protein Sequences with ESM-2 and EvoProtGrad

Community Article Published November 13, 2023

Protein engineering through directed evolution has been a cornerstone in biotechnology, enabling the optimization of proteins for industrial, therapeutic, and diagnostic applications. The advent of computational models, particularly protein language models like ESM-2, has revolutionized this field by enabling in silico directed evolution. The EvoProtGrad framework, leveraging these advanced models, allows for the rapid exploration and optimization of protein sequences, significantly accelerating the protein design process. Here we will discuss in silico directed evolution of individual proteins, as well as pairs of interacting proteins, in order to potentially optimize protein-protein interactions.

image/png

Introduction

Directed evolution, a process mimicking natural selection, is traditionally performed in vitro or in vivo, involving the generation of a vast library of protein variants followed by screening for desired traits. The emergence of protein language models, such as ESM-2 (Evolutionary Scale Modeling), has facilitated a shift towards in silico methods. These models, trained on extensive protein sequence databases, have developed an understanding of the language of proteins, allowing them to predict how changes in amino acid sequences can affect protein structure and function.

EvoProtGrad, a Python framework, integrates these protein language models to perform directed evolution in silico. It utilizes a gradient-based approach, harnessing the predictive power of models like ESM-2, to iteratively mutate protein sequences towards an optimized state. This approach enables the exploration of protein sequence space more efficiently compared to traditional methods.

Single Protein Evolution

The first method focuses on evolving a single protein sequence. The protein sequence is initially converted into a FASTA format, a widely used text-based format for representing nucleotide or peptide sequences. Each sequence is prefaced with a descriptive line starting with '>', followed by the sequence itself in subsequent lines.

The ESM-2 model and its tokenizer are then loaded as the expert system for directed evolution. The model, pretrained on vast protein sequence data, understands the complex relationships between amino acids. The tokenizer converts the protein sequences into a format that the ESM-2 model can process.

Directed evolution is initiated using the EvoProtGrad's DirectedEvolution class, specifying the ESM-2 model as the expert. The process involves running several parallel chains of Markov Chain Monte Carlo (MCMC) steps. Each chain explores the sequence space, proposing mutations at each step. The EvoProtGrad framework then evaluates these mutations based on the expert model's predictions, accepting mutations that are likely to improve the desired protein characteristics.

!pip install evo_prot_grad -q
import evo_prot_grad
from transformers import AutoTokenizer, EsmForMaskedLM

def run_evo_prot_grad(raw_protein_sequence):
    # Convert raw protein sequence to the format expected by EvoProtGrad
    # Usually, protein sequences are handled in FASTA format, so we create a mock FASTA string
    fasta_format_sequence = f">Input_Sequence\n{raw_protein_sequence}"

    # Save the mock FASTA string to a temporary file
    temp_fasta_path = "temp_input_sequence.fasta"
    with open(temp_fasta_path, "w") as file:
        file.write(fasta_format_sequence)

    # Load the ESM-2 model and tokenizer as the expert
    esm2_expert = evo_prot_grad.get_expert(
        'esm',
        model=EsmForMaskedLM.from_pretrained("facebook/esm2_t30_150M_UR50D"),
        tokenizer=AutoTokenizer.from_pretrained("facebook/esm2_t30_150M_UR50D"),
        temperature=0.95,
        device='cuda'  # or 'cpu' if GPU is not available
    )

    # Initialize Directed Evolution with the ESM-2 expert
    directed_evolution = evo_prot_grad.DirectedEvolution(
        wt_fasta=temp_fasta_path,    # path to the temporary FASTA file
        output='best',               # can be 'best', 'last', or 'all' variants
        experts=[esm2_expert],       # list of experts, in this case only ESM-2
        parallel_chains=1,           # number of parallel chains to run
        n_steps=20,                  # number of MCMC steps per chain
        max_mutations=10,            # maximum number of mutations per variant
        verbose=True                 # print debug info
    )

    # Run the evolution process
    variants, scores = directed_evolution()

    # Process the results
    for variant, score in zip(variants, scores):
        print(f"Variant: {variant}, Score: {score}")

# Example usage
raw_protein_sequence = "MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN"  # Replace with your protein sequence
run_evo_prot_grad(raw_protein_sequence)

Paired Protein Evolution

The second method extends this approach to paired protein sequences, separated by a specific marker – in this case, a string of 20 'G' amino acids. This unique separator or linker allows for the simultaneous evolution of two protein sequences while preserving their individual integrity and the relational context.

Similar to the single protein evolution, the paired sequences are formatted into a FASTA-like structure, replacing the ':' separator with the 'G' amino acid string. This modified sequence is then subjected to the directed evolution process, with the 'G' string region preserved to maintain the distinction between the two protein sequences.

During the evolution process, mutations are proposed and evaluated across both protein sequences, considering their combined context. The preserved region ensures that mutations do not disrupt the separator, maintaining the integrity of the paired format.

Results and Discussion

The EvoProtGrad framework, utilizing ESM-2, demonstrates a novel approach to protein engineering. By simulating natural evolutionary processes in silico, it allows for the rapid exploration of vast sequence spaces. The ability to evolve single or paired protein sequences provides flexibility in targeting individual proteins or protein complexes.

The use of a string of 'G' amino acids as a separator or "linker" in paired protein evolution is a unique older idea (see for example Linkers in the structural biology of protein–protein interactions). It ensures that the relational context between the protein pairs is considered during the evolution process, which is crucial for proteins that interact or function together. It is also flexible, and when the length is well chosen can give us a way of modeling protein-protein complexes using a model only trained on single sequences. Finetuning on linked protein pairs or on complexes may improve this performance, but we leave this for future work.

In silico directed evolution using EvoProtGrad and ESM-2 represents a significant advancement in protein engineering. It offers a faster, more cost-effective alternative to traditional methods, with the potential to accelerate the development of proteins with enhanced or novel functions. This computational approach, harnessing the power of advanced protein language models, is poised to become an indispensable tool in the field of protein engineering.

import evo_prot_grad
from transformers import AutoTokenizer, EsmForMaskedLM

def run_evo_prot_grad_on_paired_sequence(paired_protein_sequence):
    # Replace ':' with a string of 20 'G' amino acids
    separator = 'G' * 20
    sequence_with_separator = paired_protein_sequence.replace(':', separator)

    # Determine the start and end indices of the separator
    separator_start_index = sequence_with_separator.find(separator)
    separator_end_index = separator_start_index + len(separator)

    # Format the sequence into FASTA format
    fasta_format_sequence = f">Paired_Protein_Sequence\n{sequence_with_separator}"

    # Save the sequence to a temporary file
    temp_fasta_path = "temp_paired_sequence.fasta"
    with open(temp_fasta_path, "w") as file:
        file.write(fasta_format_sequence)

    # Load the ESM-2 model and tokenizer as the expert
    esm2_expert = evo_prot_grad.get_expert(
        'esm',
        model=EsmForMaskedLM.from_pretrained("facebook/esm2_t30_150M_UR50D"),
        tokenizer=AutoTokenizer.from_pretrained("facebook/esm2_t30_150M_UR50D"),
        temperature=0.95,
        device='cuda'  # or 'cpu' if GPU is not available
    )

    # Initialize Directed Evolution with the preserved separator region
    directed_evolution = evo_prot_grad.DirectedEvolution(
        wt_fasta=temp_fasta_path,
        output='best',
        experts=[esm2_expert],
        parallel_chains=1,
        n_steps=20,
        max_mutations=10,
        verbose=True,
        preserved_regions=[(separator_start_index, separator_end_index)]  # Preserve the 'G' amino acids string
    )

    # Run the evolution process
    variants, scores = directed_evolution()

    # Process the results, replacing the 'G' amino acids string back to ':'
    for variant, score in zip(variants, scores):
        evolved_sequence = variant.replace(separator, ':')
        print(f"Evolved Paired Sequence: {evolved_sequence}, Score: {score}")

# Example usage
paired_protein_sequence = "MLTEVMEVWHGLVIAVVSLFLQACFLTAINYLLSRHMAHKSEQILKAASLQVPRPSPGHHHPPAVKEMKETQTERDIPMSDSLYRHDSDTPSDSLDSSCSSPPACQATEDVDYTQVVFSDPGELKNDSPLDYENIKEITDYVNVNPERHKPSFWYFVNPALSEPAEYDQVAM:MASPGSGFWSFGSEDGSGDSENPGTARAWCQVAQKFTGGIGNKLCALLYGDAEKPAESGGSQPPRAAARKAACACDQKPCSCSKVDVNYAFLHATDLLPACDGERPTLAFLQDVMNILLQYVVKSFDRSTKVIDFHYPNELLQEYNWELADQPQNLEEILMHCQTTLKYAIKTGHPRYFNQLSTGLDMVGLAADWLTSTANTNMFTYEIAPVFVLLEYVTLKKMREIIGWPGGSGDGIFSPGGAISNMYAMMIARFKMFPEVKEKGMAALPRLIAFTSEHSHFSLKKGAAALGIGTDSVILIKCDERGKMIPSDLERRILEAKQKGFVPFLVSATAGTTVYGAFDPLLAVADICKKYKIWMHVDAAWGGGLLMSRKHKWKLSGVERANSVTWNPHKMMGVPLQCSALLVREEGLMQNCNQMHASYLFQQDKHYDLSYDTGDKALQCGRHVDVFKLWLMWRAKGTTGFEAHVDKCLELAEYLYNIIKNREGYEMVFDGKPQHTNVCFWYIPPSLRTLEDNEERMSRLSKVAPVIKARMMEYGTTMVSYQPLGDKVNFFRMVISNPAATHQDIDFLIEEIERLGQDL"  # Replace with your paired protein sequences
run_evo_prot_grad_on_paired_sequence(paired_protein_sequence)

Utility and Use Cases for Evolving Pairs of Protein Sequences Using In Silico Methods

Overview

The in silico directed evolution of pairs of protein sequences, as facilitated by frameworks like EvoProtGrad with ESM-2, holds significant utility in the areas of protein engineering and molecular biology. This approach is particularly useful in cases where two or more proteins interact or function in concert, which is a common scenario in biological systems. The ability to co-evolve these protein pairs can lead to insights and developments that might not be feasible with the evolution of individual proteins in isolation.

Key Use Cases

  1. Optimizing Protein-Protein Interactions: Many biological processes involve complex interactions between multiple proteins. Co-evolving pairs of proteins can lead to variants with enhanced binding affinity or specificity. This is particularly valuable in drug design, where targeting protein-protein interactions is crucial for developing effective therapeutic agents.

  2. Enzyme Substrate Pairs: Enzymes often interact with specific substrates or co-factors. Co-evolving enzyme-substrate pairs can optimize these interactions, leading to improved catalytic efficiency or altered substrate specificity. This has vast implications in industrial biocatalysis and metabolic engineering.

  3. Signal Transduction Pathways: Proteins within signal transduction pathways often work in pairs or groups. Co-evolving these proteins can help in understanding and potentially modifying the signaling pathways, which is vital in both basic biological research and in the development of treatments for diseases that involve signaling malfunctions, like cancer.

  4. Structural Protein Complexes: Structural biology heavily relies on the interactions between protein subunits. Co-evolving these subunits can lead to the formation of novel protein complexes with desired structural properties, which can be harnessed in material science and nanotechnology.

  5. Immunology: In the immune system, antibodies bind to specific antigens. Co-evolving antibodies with their respective antigens can lead to the development of more effective vaccines and immunotherapies.

Advantages of In Silico Methods

  1. Speed and Efficiency: Computational methods allow for the rapid exploration of vast protein sequence spaces, significantly quicker than experimental methods.

  2. Reduced Costs: In silico methods minimize the need for expensive and time-consuming laboratory experiments, particularly during the initial screening phases.

  3. Precision: Advanced models like ESM-2 can predict the effects of mutations with high accuracy, leading to more targeted and efficient design processes.

  4. Complex System Exploration: The ability to simultaneously evolve protein pairs allows for the exploration of complex interaction dynamics that might be difficult to study experimentally.

The directed evolution of protein pairs using computational methods opens new avenues in protein engineering, enabling the exploration and optimization of complex biological interactions. This approach is not only a boon for basic biological research but also holds immense potential in various applied fields like therapeutics, industrial biotechnology, and synthetic biology. The integration of advanced computational models like ESM-2 in platforms like EvoProtGrad represents a significant leap forward in our ability to engineer and understand the intricate patterns of proteins. For more info on EvoProtGrad, visit the docs or the github repo.