Faster Persistent Homology Alignment and Protein Complex Clustering with ESM-2 and Persistence Landscapes

Community Article Published November 30, 2023

In this article we will discuss a faster, more computationally efficient method for computing Persistent Homology Alignment (PHA) for proteins, which serves as a replacement for Multiple Sequence Alignments that works for twilight zone proteins and orphaned proteins. We also discuss how this can be used for sequence similarity and homology modeling of protein-protein complexes. Both methods use the protein language model ESM-2.

image/png

Twilight Zone Proteins

"Twilight zone proteins" is a term used in bioinformatics and molecular biology to describe protein sequences that have very low sequence identity to each other, typically below 20-35%. This term comes from the context of protein sequence alignment and refers to a region where the similarity between sequences is so low that it becomes challenging to confidently establish evolutionary relationships or structural similarities based solely on sequence comparison.

Sequence Alignment Challenges: Traditional sequence alignment algorithms may struggle to generate accurate alignments because common features are scarce and the signal-to-noise ratio is low. The low sequence identity means that the sequences have diverged significantly, making it difficult to identify homologous regions (regions derived from a common ancestral sequence).

Structural Conservation: Despite low sequence identity, proteins in the twilight zone may still share similar three-dimensional structures and functions. This is because protein structures are generally more conserved than their sequences. Structural conservation amidst low sequence similarity can provide insights into critical functional domains and evolutionary processes.

Significance for Evolutionary Studies: Understanding relationships among twilight zone proteins can be important for evolutionary biology, as it helps in tracing the divergence of species and the development of new functions in proteins.

Advanced Techniques for Analysis: To analyze twilight zone proteins, scientists often rely on advanced methods beyond simple sequence comparison. These may include structure-based alignment techniques and machine learning approaches that can consider more complex patterns and relationships beyond direct sequence similarity. We will be replacing MSA with Persistent Homology Alignment (PHA).

The term "twilight zone" highlights the complexity and ambiguity in analyzing sequences that are distantly related. It underscores the importance of integrating multiple sources of biological information, including structural, functional, and evolutionary data, to gain a comprehensive understanding of these proteins.

Persistence Landscapes

Persistence Landscapes: A Mathematical Overview

Persistence landscapes constitute a pivotal tool in topological data analysis (TDA), offering a stable and easily interpretable representation of persistent homology, a method to extract multi-scale topological features from data. This section delves into the mathematical framework underpinning persistence landscapes, elucidating their construction, properties, and utility in TDA.

Topological Data Analysis (TDA) is somewhat newer field in data science that focuses on the study of the shape (topology) of data. TDA aims to uncover intrinsic geometric and topological structures within datasets, which often remain hidden under traditional analysis methods. A key concept in TDA is persistent homology, which tracks the evolution of topological features (like connected components, holes, voids) across multiple scales.

Persistent homology quantifies multi-scale topological features of a dataset. Given a point cloud data, one constructs a filtration of simplicial complexes, typically via the Vietoris-Rips or Čech complexes, leading to a sequence of nested topological spaces. As the scale parameter increases, new topological features appear and existing ones may merge or disappear. Each feature is characterized by its birth and death times, which are the scale parameters at which the feature appears and disappears, respectively. Formally, for a filtration parameter t t , a persistence pair is denoted as (b,d) (b,d) , where b b is the birth time and d d is the death time of a topological feature. The persistence diagram is a multiset of such points in the plane, often visualized in a scatter plot.

A persistence landscape is a functional representation of a persistence diagram. It transforms the multiset of points into a sequence of continuous, piecewise-linear functions that are amenable to statistical analysis. Given a persistence diagram with points {(bi,di)} \{(b_i, d_i)\} , the persistence landscape consists of a sequence of functions {λk} \{\lambda_k\} , where each λk:RR0 \lambda_k: \mathbb{R} \to \mathbb{R}_{\geq 0} is defined as follows:

λk(t)=sup{min(tbi,dit):such that t[bi,di] and (bi,di) is the k-th largest point} \lambda_k(t) = \sup\{\min(t-b_i, d_i-t) : \text{such that } t \in [b_i, d_i] \text{ and } (b_i, d_i) \text{ is the } k\text{-th largest point}\}

Essentially, for each point in the persistence diagram, a "tent" function is constructed, centered at the midpoint of its persistence interval and with a height equal to half the persistence (difference between death and birth times). The k k -th landscape function λk \lambda_k is the k k -th largest value among these tent functions at each time t t .

  • Stability: Persistence landscapes are stable under small perturbations in the input data, making them robust features for analysis.
  • Vectorization: Each λk \lambda_k can be discretized and treated as a vector in a Euclidean space, facilitating the application of standard statistical and machine learning techniques.
  • Combinatorial Structure: The piecewise-linear nature simplifies computations and interpretations.

Persistence landscapes have been successfully applied in various domains, including shape analysis, signal processing, and biological data analysis. They provide a compact and stable summary of topological features, allowing for effective comparison, clustering, and classification of datasets based on their topological signatures. Persistence landscapes offer a powerful and versatile tool from TDA. By transforming complex topological information into a more digestible and analyzable form, they bridge the gap between abstract topological concepts and practical data analysis, paving the way for novel insights into the intrinsic structure of complex datasets.

Below we have a script for generating three random persistence diagrams with 25 points each. We then subsequently compute the persistence landscapes for each of the persistence diagrams, and compute their pairwise L2L_2-distances to get a distance matrix. From this distance matrix we compute a second level persistence diagram.

import numpy as np
import matplotlib.pyplot as plt
from gudhi.representations import Landscape

# Number of diagrams and points
num_diagrams = 3
num_points = 25

# Generate three random persistence diagrams each with 25 points
# Assuming birth and death times are uniformly distributed between 0 and 10
diagrams = [np.random.uniform(0, 10, (num_points, 2)) for _ in range(num_diagrams)]

# Ensure that death time is greater than birth time
for d in diagrams:
    d[:, 1] = np.maximum(d[:, 1], d[:, 0] + np.random.uniform(0, 1, num_points))

# Compute landscapes for each diagram
landscape = Landscape(num_landscapes=3, resolution=100)
landscapes = landscape.fit_transform(diagrams)

# Plot the landscapes
plt.figure(figsize=(15, 5))
for i, l in enumerate(landscapes, start=1):
    plt.subplot(1, num_diagrams, i)
    plt.plot(l[:100], label="Landscape 1")
    plt.plot(l[100:200], label="Landscape 2")
    plt.plot(l[200:], label="Landscape 3")
    plt.title(f"Diagram {i} Landscapes")
    plt.xlabel("Sample Points")
    plt.ylabel("Amplitude")
    plt.legend()

plt.tight_layout()
plt.show()

The landscapes should looks similar to the following landscapes:

image/png

The pairwise L2L_2-distance matrix will then look something like this:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.spatial.distance import cdist
import gudhi as gd
from gudhi import RipsComplex
from gudhi.representations import DiagramSelector

# Assuming 'landscapes' variable contains the landscapes computed previously
# If 'landscapes' is not available, we can use a placeholder or regenerate them

# Compute L2 distances between each pair of landscapes
dist_matrix = cdist(landscapes, landscapes, metric='euclidean')

# Print the distance matrix as a heatmap
plt.figure(figsize=(6, 6))
sns.heatmap(dist_matrix, annot=True, cmap='viridis')
plt.title("L2 Distance Matrix Heatmap")
plt.xlabel("Landscape Index")
plt.ylabel("Landscape Index")
plt.show()

# Compute the persistence diagram of the distance matrix with Gudhi's Rips complex
rips_complex = RipsComplex(distance_matrix=dist_matrix)
simplex_tree = rips_complex.create_simplex_tree(max_dimension=2)
diag = simplex_tree.persistence()

# Plot the persistence diagram
plt.figure()
gd.plot_persistence_diagram(diag)
plt.title("Persistence Diagram of the Rips Complex")
plt.xlabel("Birth")
plt.ylabel("Death")
plt.show()

image/png

Once we have the L2L_2-distance matrix in hand, we get a second level persistence diagram which will look something like this:

image/png

Now we can use this persistence diagram to perform a DBSCAN clustering on the persistence landscapes by choosing a distance threshold that captures some percentage of the red points in this second level persistence diagram. In the next section we will do this for protein-protein complexes with a threshold that captures 80% of the points in this second level persistence diagram. In other words, we choose a distance threshold ϵ\epsilon for the DBSCAN such that 80% of the points in the diagram fall below this threshold.

Clustering Protein Sequences and Protein-Protein Complexes

If you would like to run the following script, start by downloading this dataset of pairs of interacting human proteins (originally obtained from UniProt), or head over to UniProt to curate your own dataset of interacting protein pairs. Make sure to adjust the path to your file in the script before running it. This script will use the method describe above to clustering interacting pairs of proteins using the protein language model ESM-2. This is a substantial improvement in terms of speed over the methods describe in this post. In particular, computing the L2L_2-distances between pairs of landscapes for 1000 protein-protein complexes can be done in about 30 minutes rather than requiring a full day to compute the Wasserstein distances between pairs of persistence diagrams, or several hours for the bottleneck distance.

import pandas as pd
import numpy as np
from transformers import EsmModel, AutoTokenizer
import torch
from scipy.spatial.distance import pdist, squareform
from gudhi import RipsComplex
from gudhi.representations.vector_methods import Landscape
from sklearn.cluster import DBSCAN
# import matplotlib.pyplot as plt
from tqdm import tqdm

# Define a helper function for hidden states
def get_hidden_states(sequence, tokenizer, model, layer):
    model.config.output_hidden_states = True
    encoded_input = tokenizer([sequence], return_tensors='pt', padding=True, truncation=True, max_length=1024)
    with torch.no_grad():
        model_output = model(**encoded_input)
    hidden_states = model_output.hidden_states
    specific_hidden_states = hidden_states[layer][0]
    return specific_hidden_states.numpy()

# Define a helper function for Euclidean distance matrix
def compute_euclidean_distance_matrix(hidden_states):
    euclidean_distances = pdist(hidden_states, metric='euclidean')
    euclidean_distance_matrix = squareform(euclidean_distances)
    return euclidean_distance_matrix

# Define a helper function for persistent homology
def compute_persistent_homology(distance_matrix, max_dimension=0):
    max_edge_length = np.max(distance_matrix)
    rips_complex = RipsComplex(distance_matrix=distance_matrix, max_edge_length=max_edge_length)
    st = rips_complex.create_simplex_tree(max_dimension=max_dimension)
    st.persistence()
    persistence_pairs = np.array([[p[1][0], p[1][1]] for p in st.persistence() if p[0] == 0 and p[1][1] < np.inf])  # Filter out infinite death times
    return st, persistence_pairs

# Define a helper function for persistent homology
def compute_persistent_homology2(distance_matrix, max_dimension=0):
    max_edge_length = np.max(distance_matrix)
    rips_complex = RipsComplex(distance_matrix=distance_matrix, max_edge_length=max_edge_length)
    st = rips_complex.create_simplex_tree(max_dimension=max_dimension)
    st.persistence()
    return st, st.persistence()

# Define a helper function for Landscape transformations with tqdm
def compute_landscapes(persistence_diagrams, num_landscapes=5, resolution=10000):
    landscape_transformer = Landscape(num_landscapes=num_landscapes, resolution=resolution)
    landscapes = landscape_transformer.fit_transform([d for d in persistence_diagrams if len(d) > 0])  # Filter out empty diagrams
    return landscapes

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
model = EsmModel.from_pretrained("facebook/esm2_t33_650M_UR50D")

# Define layer to be used
layer = model.config.num_hidden_layers - 1

# Load the TSV file
file_path = 'filtered_protein_interaction_pairs.tsv'
protein_pairs_df = pd.read_csv(file_path, sep='\t')

# Only process the first 1000 proteins
protein_pairs_df = protein_pairs_df.head(1000)

# Extract concatenated sequences
concatenated_sequences = protein_pairs_df['Protein1'] + protein_pairs_df['Protein2']

# Initialize list to store persistent diagrams
persistent_diagrams = []

# Loop over concatenated sequences to compute their persistent diagrams
for sequence in tqdm(concatenated_sequences, desc="Computing Persistence Diagrams"):
    hidden_states_matrix = get_hidden_states(sequence, tokenizer, model, layer)
    distance_matrix = compute_euclidean_distance_matrix(hidden_states_matrix)
    _, persistence_diagram = compute_persistent_homology(distance_matrix)
    persistent_diagrams.append(persistence_diagram)

# Compute landscapes
landscapes = compute_landscapes(persistent_diagrams)

# Compute the L2 distances between landscapes
with tqdm(total=len(landscapes)*(len(landscapes)-1)//2, desc="Computing Pairwise L2 Distances") as pbar:
    l2_distances = np.zeros((len(landscapes), len(landscapes)))
    for i in range(len(landscapes)):
        for j in range(i+1, len(landscapes)):
            l2_distances[i, j] = l2_distances[j, i] = np.linalg.norm(landscapes[i] - landscapes[j])
            pbar.update(1)

# Compute the second-level persistent homology using the L2 distance matrix
with tqdm(total=1, desc="Computing Second-Level Persistent Homology") as pbar:
    st_2, persistence_2 = compute_persistent_homology2(l2_distances)
    pbar.update(1)

# Function to calculate the epsilon for DBSCAN
def calculate_epsilon(persistence_diagrams, threshold_percentage, max_eps=np.inf):
    lifetimes = [p[1][1] - p[1][0] for p in persistence_diagrams if p[0] == 0]
    lifetimes.sort()
    threshold_index = int(threshold_percentage * len(lifetimes))
    epsilon = lifetimes[threshold_index]
    # Ensure epsilon is within a reasonable range
    epsilon = min(epsilon, max_eps)
    return epsilon

# Calculate epsilon with a maximum threshold
threshold_percentage = 0.8  # 80%
max_epsilon = 5000.0  # Example maximum threshold
epsilon = calculate_epsilon(persistence_2, threshold_percentage, max_eps=max_epsilon)

# Perform DBSCAN clustering
with tqdm(total=1, desc="Performing DBSCAN Clustering") as pbar:
    dbscan = DBSCAN(metric="precomputed", eps=epsilon, min_samples=1)
    dbscan.fit(l2_distances)  # Use L2 distances here
    labels = dbscan.labels_
    pbar.update(1)

# Add the cluster labels to the DataFrame
protein_pairs_df['Cluster'] = labels

# Save the DataFrame with cluster information
output_file_path = 'clustered_protein_interaction_pairs_l2_distances.tsv'
protein_pairs_df.to_csv(output_file_path, sep='\t', index=False)

print(f"Clustered data saved to: {output_file_path}")

We note that adjusting the num_landscapes parameter and the resolution to appropriate values is still very much in the experimental phase, but this will get you started in your journey to clustering protein-protein complexes using ESM-2 and persistent homology.

Replacing MSA with PHA

Next, we discuss the potential to replace Multiple Sequence Alignments (MSA) with Persistent Homology Alignment (PHA), following this blog post. In our setup, we now have a way of constructing persistence landscapes for any protein, very similar to the above procedure, but now for single proteins instead of concatenated pairs of proteins. We also have a way of constructing a second level persistence diagram based on the L2L_2-distances between the persistence landscapes. From this, we perform the DBSCAN as before, the proceed with the vcMSA methods described in this blog post.

As a next step, you might try other methods for vectorizing persistence diagrams in place of persistence landscapes. For example you might try persistence images, or some other method explained in the Gudhi documentation. You might also experiment with other measures of similarity other than the L2L_2-norm.

Conclusion

In conclusion, this article has introduced an innovative approach for analyzing protein sequences, particularly those in the challenging "twilight zone", by utilizing Persistent Homology Alignment (PHA) and the advanced protein language model ESM-2. The methods discussed here offer a significant improvement over traditional Multiple Sequence Alignments, particularly in terms of computational efficiency and their ability to handle proteins with low sequence identity.

The application of Persistence Landscapes in topological data analysis has been shown to be a powerful tool for extracting meaningful topological features from protein data. This mathematical framework enables the handling of complex data structures and provides a stable method for comparing and clustering proteins based on their intrinsic topological properties. Furthermore, the integration of ESM-2 in clustering protein-protein complexes marks a substantial advancement in the field of bioinformatics. This method not only expedites the computational process but also enhances the accuracy of protein interaction predictions. The use of PHA, coupled with the vcMSA methods, further demonstrates the potential for these techniques in replacing traditional sequence alignment methods, especially for proteins with obscure or unique sequences.

Overall, the advancements in computational methods and the integration of sophisticated models like ESM-2 and PHA in protein sequence analysis represent a significant stride forward in the field of bioinformatics. These methods open new avenues for understanding the complex world of proteins, their interactions, and their evolutionary relationships. They also pave the way for future research where further refinements and innovations could lead to even more efficient and accurate methods for protein analysis.