Persistent Homology Alignment (PHA): Replacing Multiple Sequence Alignments using ESM-2 and Persistent Homology

Community Article Published November 15, 2023

In the paper Leveraging protein language models for accurate multiple sequence alignments and Vector-clustering Multiple Sequence Alignment: Aligning into the twilight zone of protein sequence similarity with protein language models, the authors discuss the potential of replacing traditional Multiple Sequence Alignments (MSAs) using embeddings from a protein language model (pLM). Here we will discuss an adaptation of this method using the pLM ESM-2 and an alternative method to determine sequence similarity using persistent homology which we call "Persistent Homology Alignment" (PHA).

image/png

Introduction

The following method works well for "twilight zone" proteins. The term "twilight zone" in the context of protein sequences refers to a range of sequence similarity where it becomes particularly challenging to determine whether two protein sequences are evolutionarily related. This concept is critical in bioinformatics, especially in the fields of protein sequence alignment and homology detection.

Protein sequences are often compared to infer their evolutionary relationships. This comparison is typically done using sequence alignment techniques, which identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. The degree of similarity is often quantified in terms of sequence identity, which is the percentage of amino acids in aligned sequences that are identical.

The twilight zone typically refers to sequence identities in the range of about 20% to 35%. In this range:

  1. Low Sequence Identity: The sequences are so divergent that their similarity could either be due to a distant evolutionary relationship or simply occur by chance. This makes it difficult to confidently infer homology (evolutionary relatedness) based solely on sequence identity.

  2. Structural Conservation: Despite low sequence similarity, proteins in the twilight zone often retain similar three-dimensional structures and functions. This structural conservation despite sequence divergence is a remarkable aspect of protein evolution.

  3. Alignment Challenges: Traditional sequence alignment methods and scoring systems (like substitution matrices) may struggle in the twilight zone. They often require additional information or sophisticated techniques, like those incorporating structural or functional data, to accurately align sequences and infer relationships.

  4. Significance in Evolutionary Studies: The twilight zone is of great interest in evolutionary biology and bioinformatics, as studying these low-similarity sequences can provide insights into how protein structures and functions have been conserved or have evolved over vast evolutionary timescales.

Overall, proteins within the twilight zone of alignment present a unique challenge and opportunity for bioinformatics, necessitating advanced methods to discern their evolutionary and functional relationships.

The method we will use is based on the following steps:

  1. Embedding Generation: Utilizing a protein language model (ESM-2), we generate per-position and embeddings for each protein, capturing the unique context and characteristics of amino acids.

  2. Sequence Clustering: We cluster proteins based on their sequence-level embeddings, grouping them into sets with high sequence similarity. This step aids in identifying proteins with potential functional or evolutionary relationships. This step employs the use of persisten homology to construct a barcode diagram for each protein. The barcodes are then clustered using the Wasserstine distance metric and DBSCAN to obtain clusters of similar proteins.

  3. Amino Acid Similarity and RBH Identification: Within each cluster, we measure the cosine similarity of amino acid vectors across sequences. We focus on identifying reciprocal best hits (RBHs), establishing correspondences between amino acids in different sequences.

  4. RBH Network and Guidepost Clustering: A network based on RBH relationships is constructed, followed by clustering to identify guidepost amino acids. These guideposts serve as reliable alignment markers across the sequences.

  5. Ordering Clusters into Columns: Using a directed acyclic graph (DAG) and topological sorting, we order the clusters to form the columns of the PHA, ensuring the correct sequential arrangement.

  6. Finalizing the Alignment: The algorithm iteratively assigns unplaced amino acids to the PHA columns, refining guidepost placements. The final step involves merging subalignments from various clusters to create a comprehensive PHA.

PHA is based on another method known as vcMSA. For a video about the vcMSA method, check out this video. We will focus on implementing Steps (1) and (2) only, as the rest is already implemented in the vcMSA Github. Replacing steps (1) and (2) in vcMSA with persistent homology based clustering of protein sequences will be more robust and will capture sequence level information better. The rest of the PHA method remains unchanged from the vcMSA technique.

Imports

# Let's start by importing necessary libraries
from transformers import EsmModel, AutoTokenizer
import gudhi as gd
import numpy as np
import torch
from scipy.spatial.distance import pdist, euclidean, squareform
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import silhouette_score
from gudhi.hera import wasserstein_distance

Helper Functions:

  • get_hidden_states: Extracts the hidden states from a specific layer of the ESM-2 model for a protein sequence.
  • compute_euclidean_distance_matrix_scipy: Computes the pairwise Euclidean distances between hidden state vectors.
  • compute_persistent_homology: Calculates persistent homology from the distance matrix, generating a persistence diagram.
  • compute_wasserstein_distances: Determines Wasserstein distances between persistence diagrams, essential for comparing topological features across different proteins.

The following Python code encapsulates a complex and comprehensive workflow integrating deep learning models, protein modeling, and advanced mathematical techniques for analyzing protein sequences. The code involves multiple key steps, including extracting hidden states from a protein language model, computing Euclidean distance matrices, analyzing topological data with persistent homology, and finally clustering the results. Let's break down each part of this workflow in detail.

Hidden State Extraction

The function get_hidden_states takes a protein sequence, tokenizes it, and feeds it into a neural network model (specifically, a transformer-based protein language model like ESM-2) to extract hidden states from a specified layer. Mathematically, if we have an input sequence X X and a neural network model f f parameterized by weights θ \theta , the hidden state H H at layer l l can be represented as:

H(l)=fθ(l)(X) H^{(l)} = f^{(l)}_{\theta}(X)

where fθ(l) f^{(l)}_{\theta} denotes the function corresponding to the l l -th layer of the model.

# Helper function to get the hidden states of a specific layer for a given input sequence
def get_hidden_states(tokenizer, model, layer, input_sequence):
    model.config.output_hidden_states = True
    encoded_input = tokenizer([input_sequence], return_tensors='pt', padding=True, truncation=True)
    with torch.no_grad():
        model_output = model(**encoded_input)
    hidden_states = model_output.hidden_states
    specific_hidden_states = hidden_states[layer][0]
    return specific_hidden_states

Euclidean Distance Matrix Computation

The compute_euclidean_distance_matrix_scipy function calculates the pairwise Euclidean distances between all pairs of vectors (hidden states) and returns a distance matrix. If H H is a matrix where each row represents a vector (hidden state), the Euclidean distance between two vectors Hi H_i and Hj H_j is given by:

Dij=HiHj2 D_{ij} = \| H_i - H_j \|_2

where 2 \| \cdot \|_2 denotes the Euclidean norm. The resulting matrix D D is a symmetric matrix where each element Dij D_{ij} represents the distance between the i i -th and j j -th vectors.

# Helper function to compute the Euclidean distance matrix
def compute_euclidean_distance_matrix_scipy(hidden_states):
    euclidean_distances = pdist(hidden_states.numpy(), metric=euclidean)
    euclidean_distance_matrix = squareform(euclidean_distances)
    return euclidean_distance_matrix

Persistent Homology Computation

compute_persistent_homology uses the distance matrix to compute the persistent homology of the data. Persistent homology is a method from topological data analysis that studies the "shape" of data by examining how topological features (like connected components, loops, etc.) appear and disappear as a threshold parameter changes. For a given distance matrix D D , a simplicial complex is constructed, and its topology is analyzed across different scales. The persistence of topological features is typically represented by a persistence diagram or barcode.

# Helper function to compute the persistent homology of a given distance matrix
def compute_persistent_homology(distance_matrix, max_dimension=3):
    max_edge_length = np.max(distance_matrix)
    rips_complex = gd.RipsComplex(distance_matrix=distance_matrix, max_edge_length=max_edge_length)
    st = rips_complex.create_simplex_tree(max_dimension=max_dimension)
    persistence = st.persistence()
    return st, persistence

Wasserstein Distance Computation

The function compute_wasserstein_distances computes pairwise Wasserstein distances between persistence diagrams. The Wasserstein distance is a measure of the difference between two probability distributions and, in this context, is used to quantify the dissimilarity between persistence diagrams. Mathematically, for two persistence diagrams P P and Q Q , the Wasserstein distance W(P,Q) W(P, Q) is defined as:

Wqp(P,Q)=(infγΓ(P,Q)(x,y)γxyqp)1/p W_q^p(P, Q) = \left(\inf_{\gamma \in \Gamma(P, Q)} \sum_{(x, y) \in \gamma} \| x - y \|_q^p \right)^{1/p}

where Γ(P,Q) \Gamma(P, Q) is the set of all possible matchings between points in P P and Q Q , and q \| \cdot \|_q is the q q -norm.

# Helper function to compute the Wasserstein distances between all pairs of persistence diagrams
def compute_wasserstein_distances(persistence_diagrams, dimension):
    n_diagrams = len(persistence_diagrams)
    distances = np.zeros((n_diagrams, n_diagrams))
    filtered_diagrams = [[point for point in diagram if point[0] == dimension] for diagram in persistence_diagrams]
    for i in range(n_diagrams):
        for j in range(i+1, n_diagrams):
            X = np.array([p[1] for p in filtered_diagrams[i]])
            Y = np.array([p[1] for p in filtered_diagrams[j]])
            distance = wasserstein_distance(X, Y)
            distances[i][j] = distance
            distances[j][i] = distance
    return distances

Loading ESM-2 Model and Tokenizer

Next, we load the ESM-2 model that we want to use. Here we use facebook/esm2_t6_8M_UR50D from Hugging Face.

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")
model = EsmModel.from_pretrained("facebook/esm2_t6_8M_UR50D")

Define the Layer to use for Embeddings

Now, we choose a layer of the model to use for obtaining the hidden states (embedding vectors). This can be chosen based on the results in The geometry of hidden representations of protein language models and The geometry of hidden representations of large transformer models.

# Define layer to be used
num_layers = model.config.num_hidden_layers
layer = num_layers - 1  # Index of the last layer

Computing the Persistence Diagram for a Single Protein

We can compute the persistence diagram for a single protein sequence as follows:

# Define a sample protein sequence
input_sequence = "GLSDGEWQQVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGATVLTALGGILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG"

# Compute the hidden states matrix
hidden_states_matrix = get_hidden_states(tokenizer, model, layer, input_sequence)

# Compute the distance matrix
distance_matrix = compute_euclidean_distance_matrix_scipy(hidden_states_matrix)

# Compute the persistent homology
# max_dimension = 4 may take several minutes
# max_dimension = bigger number means the computation will take longer
st, persistence = compute_persistent_homology(distance_matrix, max_dimension=3)
gd.plot_persistence_diagram(persistence)

This will return a plot of the persistence diagram:

image/png

Clustering Proteins using Persistent Homology

Next, we define a list of proteins we want to cluster using persistent homology. We compute the persistence diagram for each protein, then compute the Wasserstein distance between each pair of persistence diagrams. Finally, we run a persistent homology informed DBSCAN to cluster the persistence diagrams using the Wasserstein distance matrix we just computed. This gives us a sematically rich and meaningful clustering of the proteins based on their persistent homology, that is, based on their persistence diagrams. Note, we could have also used the barcodes instead of persistence diagrams as there is a bijection between the two showing they are equivalent. So, if you prefer barcodes, you may swap them out for persistence diagrams.

sequences = [

    "MAHMTQIPLSSVKRPLQVRGIVFLGICTQKGTVYGNASWVRDQARH",
     "MKHVTQIPKSSVRRPLQFRGICFLGTCQKGTVYGKASWVHDQARHA",
     "MNHITQVPLSSAKRPLQVRGICFLGITCKNGTVYGKACWVRDQARH",

    "MKLITILGLLALATLVQSTGCVTVNAAHCGVTTGQTVCAGVAKCRAE",
     "MKLITILGALALATLVQSTGCVNVNAAHCVTTGQTVCAGVAKCRAET",
     "MKLITILGALALATLVQSTGCVNVNAAHCVTAGQTVCAGVAKCRAETS",

    "MGSSHHHHHHSSGLVPRGSHMENITVVKFNGTQTFEVHPNVSVGQAGV",
     "MGSSHHHHHHSSGLVPRGSHMENITVVKFNGTQTFEVHPNVSVGQAGVR",
     "MGSSHHHHHHSSGLVPRGSHMENITVVKFNGTQTFEVHPNVSVGQAGVRR",

    "MGGHNGWQILVKGKWTTMDFLRNAVIDQKLRRARRELKLMKAFESLK",
     "MGGHNGWQILVKGKWTTMDFLRNAVIDQKLRRARRELKLMKAFESLKN",
     "MGGHNGWQILVKGKWTTMDFLRNAVIDQKLRRARRELKLMKAFESLKNN",

    "MAQSNISDAMVQLTPAGRSLMLLVQHGSQVAAGVTFQDNQRFPGGRD",
     "MAQSNISDAMVQLTPAGRSLMLLVQHGSQVAAGVTFQDNQRFPGGRDF",
     "MAQSNISDAMVQLTPAGRSLMLLVQHGSQVAAGVTFQDNQRFPGGRDFF"
]


# Initialize list to store persistent diagrams
persistent_diagrams = []

# Compute persistent homology for each sequence
for sequence in sequences:
    hidden_states_matrix = get_hidden_states(tokenizer, model, layer, sequence)
    distance_matrix = compute_euclidean_distance_matrix_scipy(hidden_states_matrix)
    _, persistence_diagram = compute_persistent_homology(distance_matrix)
    
    # Store the persistent diagram
    persistent_diagrams.append(persistence_diagram)

Here, we have a choice to make as to which dimension we want to use for persistent homology. We can choose dimension 0 to cluster based on the zero dimensional persistent homology features. We could also choose higher dimensions to cluster based on higher dimensional topological features. The methods has not been extensively tested to determine which is best, so we leave this choice to you for now. Here we choose dimension zero to keep things interpretable and simple.

# Compute the Wasserstein distances between all pairs of persistence diagrams
wasserstein_distances = compute_wasserstein_distances(persistent_diagrams, 0)
# Compute the persistent homology of the Wasserstein distance matrix
st_2, persistence_2 = compute_persistent_homology(wasserstein_distances)
# Plot the persistence diagram
gd.plot_persistence_diagram(persistence_2)

This will output another persistence diagram, from which we choose our epsilon for our DBSCAN. In particular, we find large gaps between the red dots and choose an epsilon that falls in one of these gaps until we get a high silhouette score for the DBSCAN:

image/png

DBSCAN Clustering

Finally, DBSCAN clustering is applied to cluster the persistence diagrams based on their Wasserstein distances. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is an algorithm that groups together points that are closely packed together while marking points in low-density regions as outliers. The silhouette coefficient is used to evaluate the quality of the clustering. Here, we choose eps=40.0, based on the above persistence diagram.

# Perform DBSCAN clustering of persistence diagrams
dbscan = DBSCAN(metric="precomputed", eps=40.0, min_samples=1).fit(wasserstein_distances)
labels_dbscan = dbscan.labels_
if len(set(labels_dbscan)) > 1:  # More than 1 cluster
    print("Silhouette Coefficient for DBSCAN: %0.3f" % silhouette_score(wasserstein_distances, labels_dbscan))
else:
    print("Cannot compute Silhouette Coefficient for DBSCAN as there is only one cluster.")
# Print the clusters for DBSCAN
print("\nClusters for DBSCAN:")
dbscan_clusters = {i: [] for i in set(labels_dbscan)}
for sequence, label in zip(sequences, labels_dbscan):
    dbscan_clusters[label].append(sequence)
for label, cluster in dbscan_clusters.items():
    print(f"Cluster {label}: {cluster}")
Clusters for DBSCAN:
Cluster 0: ['MAHMTQIPLSSVKRPLQVRGIVFLGICTQKGTVYGNASWVRDQARH', 'MNHITQVPLSSAKRPLQVRGICFLGITCKNGTVYGKACWVRDQARH']
Cluster 1: ['MKHVTQIPKSSVRRPLQFRGICFLGTCQKGTVYGKASWVHDQARHA']
Cluster 2: ['MKLITILGLLALATLVQSTGCVTVNAAHCGVTTGQTVCAGVAKCRAE']
Cluster 3: ['MKLITILGALALATLVQSTGCVNVNAAHCVTTGQTVCAGVAKCRAET', 'MKLITILGALALATLVQSTGCVNVNAAHCVTAGQTVCAGVAKCRAETS']
Cluster 4: ['MGSSHHHHHHSSGLVPRGSHMENITVVKFNGTQTFEVHPNVSVGQAGV', 'MGSSHHHHHHSSGLVPRGSHMENITVVKFNGTQTFEVHPNVSVGQAGVR', 'MGSSHHHHHHSSGLVPRGSHMENITVVKFNGTQTFEVHPNVSVGQAGVRR']
Cluster 5: ['MGGHNGWQILVKGKWTTMDFLRNAVIDQKLRRARRELKLMKAFESLK', 'MGGHNGWQILVKGKWTTMDFLRNAVIDQKLRRARRELKLMKAFESLKN', 'MGGHNGWQILVKGKWTTMDFLRNAVIDQKLRRARRELKLMKAFESLKNN', 'MAQSNISDAMVQLTPAGRSLMLLVQHGSQVAAGVTFQDNQRFPGGRD', 'MAQSNISDAMVQLTPAGRSLMLLVQHGSQVAAGVTFQDNQRFPGGRDF', 'MAQSNISDAMVQLTPAGRSLMLLVQHGSQVAAGVTFQDNQRFPGGRDFF']

Now that we have our clusters of proteins, we can use the rest of the methods described in Leveraging protein language models for accurate multiple sequence alignments and Vector-clustering Multiple Sequence Alignment: Aligning into the twilight zone of protein sequence similarity with protein language models to align the protein sequences. In particular, we have thus far replaced steps (A) and (B) in the figure below with a persistent homology clustering of the proteins, which is expected to perform better.

image/png

We note at this point we can remove outliers from the clusters as is done in vcMSA, or we can use the Fréchet mean persistence diagram as a way of determining the outliers. For more information on Fréchet means (barycenters) of persistence diagrams, see here, and the example notebook provided. The rest of the implementation can be done following the Github repo for the papers. For example, we can now implement step (3) to find the reciprocal best hits inside each cluster using the following function:

from scipy.spatial.distance import cosine

def identify_rbh(hidden_states_matrices, threshold=0.8):
    """
    Identifies Reciprocal Best Hits (RBHs) based on cosine similarity.
    :param hidden_states_matrices: List of hidden states matrices for each protein.
    :param threshold: Cosine similarity threshold for considering a hit.
    :return: Dictionary of RBH pairs for each protein pair.
    """
    num_proteins = len(hidden_states_matrices)
    rbh_pairs = {}

    for i in range(num_proteins):
        for j in range(i + 1, num_proteins):
            max_similarity = {}
            protein_i, protein_j = hidden_states_matrices[i], hidden_states_matrices[j]

            # Compute cosine similarity between all pairs of amino acids
            for idx_i, amino_acid_i in enumerate(protein_i):
                for idx_j, amino_acid_j in enumerate(protein_j):
                    similarity = 1 - cosine(amino_acid_i, amino_acid_j)
                    if similarity > threshold:
                        max_similarity[(idx_i, i)] = max(similarity, max_similarity.get((idx_i, i), 0))
                        max_similarity[(idx_j, j)] = max(similarity, max_similarity.get((idx_j, j), 0))

            # Identify RBHs
            rbh_pairs[(i, j)] = [(idx_i, idx_j) for (idx_i, _), sim_i in max_similarity.items() 
                                 for (idx_j, _), sim_j in max_similarity.items() 
                                 if sim_i == sim_j and sim_i > threshold]

    return rbh_pairs

This will provide the cosine similarity between amino acid embedding vectors within each cluster and allow us to cluster the amino acids that have high similarity.

Conclusion

This new method (PHA) will serve as a replacement for traditional Multiple Sequence Alignement (MSA), and will be more robust and applicable to protein sequences with low sequence similarity or long indels. We note that this circumvents the need for an MSA algorithm, avoiding initial guide tree construction, intermediate pairwise alignments, gap penalties, and substitution matrices. Moreover, the added information from contextual embeddings leads to higher accuracy alignments for structurally similar proteins with low amino-acid similarity. This provides a major step forward in replacing MSAs with more robust methods that accomplish the same goal.

The question now arises, if protein language models like ESM-2 and ESMFold, which were trained on single sequences without MSA, coupled with more geometric methods like vcMSA or PHA can replace the usual MSAs, why don't we try to build this into our deep learning models to eliminate the need for MSAs altogether? It is the key reason that AlphaFold2 is so much slower than ESMFold, and is a huge bottleneck in terms of compute and time. While MSAs often provide improved accuracy on certain tasks, they are difficult or impossible to construct for many proteins. Moreover, for more complicated structures involving protein complexes interacting with small molecules, DNA or RNA, they are even more difficult to construct. Thus, having a model that uses an alternative method like PHA or vcMSA, or that circumvents the needs for them altogether would be very useful.