Seq-to-Pheno
AI & ML interests
Genomes & Proteins
Project PhenoSeq: Protein Network Analysis for Phenotypic Outcomes
While demonstrating promising results in basic prediction tasks, the project identified key areas for improvement in protein-phenotype relationship modeling. The findings provide a foundation for future work in protein network analysis and phenotype prediction.
This project represents a significant step forward in understanding protein-phenotype relationships, while highlighting important areas for future research and development in computational biology.
Project Overview
PhenoSeq is an innovative project focused on understanding how protein networks contribute to organism-scale phenotypes, particularly in cancer growth and organism longevity. The project leverages protein embeddings from ESM (Evolutionary Scale Modeling) combined with graph neural networks to predict phenotypic outcomes through protein-protein interactions (PPIs).
Core Objectives
- Develop predictive models for understanding biological drivers of complex diseases
- Create frameworks for inferring oncogenic potential of genetic mutations
- Analyze clinical significance of protein modifications using sequence embeddings
- Establish connections between protein networks and phenotypic outcomes
Data Sources
The project utilized three major public databases:
- DepMap: CRISPR-based experimental data measuring protein deletion effects on cancer cell proliferation
- TCGA: The Cancer Genome Atlas data
- Longevity Database: Species longevity information
Methodological Approach
Model Development
The team developed three distinct models:
Baseline Model
- Fully connected network predicting CRISPR scores from embeddings
- Achieved correlation of 0.55 with ground truth
- Outperformed K-nearest neighbors baseline
- Performance correlated with training set proximity
Cell Line-Specific Model
- Incorporated cell line identity through one-hot embedding
- Included mutation status (wild type vs mutated)
- Achieved 0.44 correlation with ground truth
- Limited success in predicting cell line-specific differences
PPI-Informed Model
- Integrated protein-protein interaction data
- Results comparable to cell line-specific model
- Limited additional performance gain from PPI integration
Additional Analyses
Species Longevity Analysis
- Challenges in cross-phylogenetic prediction
- Limited success across different orders of the phylogenetic tree
TCGA Patient Survival Analysis
- Achieved significant correlations
- Performance below initial expectations
Key Findings
- ESM3 embeddings contain valuable functional information
- Simple models can outperform basic baselines
- Current approach limitations in capturing subtle effects
- Challenges in predicting mutation-specific impacts
Future Directions
- Integration of additional data types:
- Copy number variation
- Transcriptomic information
- Exploration of amino acid level embeddings
- Enhanced signal processing methods
- Improved model architectures
Technical Achievements
- Successful implementation of protein embedding analysis
- Development of multiple predictive models
- Integration of complex biological datasets
- Novel approaches to phenotype prediction
Limitations and Challenges
- Limited success in cell line-specific predictions
- Challenges in cross-phylogenetic predictions
- Subtle effect detection limitations
- Data integration complexities
Impact and Applications
- Enhanced understanding of disease mechanisms
- Improved drug target identification
- Better prediction of genetic mutation effects
- Advanced protein function analysis
PhenoSeq Longevity Analysis Component
This analysis revealed both the potential and limitations of using protein sequence data for predicting species longevity, highlighting the importance of taxonomic relationships in such predictions.
Overview
The longevity analysis component of PhenoSeq investigated the relationship between protein sequences and species lifespan across different taxonomic orders, with a particular focus on Primates, Chiroptera (bats), and Cetacea (whales).
Key Findings
1. Taxonomic Order Analysis
- The study examined lifespan distributions across multiple orders including:
- Rodentia
- Artiodactyla
- Carnivora
- Primates
- Chiroptera
- Cetacea
- Diprotodontia
- Perissodactyla
2. Prediction Performance
- Mean predictions across orders were relatively successful
- However, predictions within individual orders showed limited accuracy
- High-performing proteins were not well conserved between different orders
3. Model Architecture Insights
- Later layers in the neural network did not provide significant additional information
- Training curves showed convergence but with limitations in prediction accuracy
4. Protein Embedding Analysis
- Analysis of protein ALDOB showed that:
- Nearest neighbor species in embedding space typically belonged to the same Order/Family
- Strong taxonomic clustering was observed in the embedding space
5. Hierarchical Prediction Accuracy
Correlation strength increased with taxonomic specificity:
- Order level: r = 0.8 (271 species across 12 orders)
- Family level: r = 0.92 (191 species across 27 families)
- Genus level: r = 0.97 (47 species across 15 genera)
Technical Limitations
- Limited success in cross-order predictions
- Difficulty in generalizing predictions across distant phylogenetic relationships
- Need for order/family-specific modeling approaches
Key Insights
- Strong within-taxon predictions
- Decreasing accuracy with increasing phylogenetic distance
- Need for taxonomic stratification in prediction models
- High predictive power at genus level suggests strong genetic influence on longevity within closely related species
PhenoSeq DepMap Analysis Component
This analysis demonstrated both the potential and current limitations of using protein sequence data to predict cancer-relevant protein functions, highlighting areas for future improvement in protein-phenotype prediction models.
Overview
The DepMap component investigated protein function in cancer through CRISPR-based knockout experiments, analyzing 9,353 proteins across 1,150 different cell lines to understand their effects on cancer cell growth.
Three Models :
- Baseline Model
- Input: Average protein embedding across all cell lines
- Output: Average CrisprScore across all cell lines
- Architecture: Simple feedforward network using ESM3-open-small embeddings
- Performance: Achieved Pearson correlation of 0.55
- Outperformed KNN baseline across all K values
- Cell-line-specific Model
- Predicted CrisprScore effects for each protein-cell line combination
- Performance: Achieved Pearson correlation of 0.44
- Limited success in predicting protein-specific differences between cell lines
- Poor correlation (r=0.01) for individual proteins like MYC across cancer types
- PPI-informed Model
- Incorporated protein-protein interaction networks
- Aimed to predict CrisprScore effects by propagating signals through PPI networks
- Results similar to cell-line-specific model
Key Findings
Model Performance
- Baseline model showed strong general prediction capability
- Distance to nearest neighbors in training set affected performance
- Larger networks didn't necessarily improve performance
- Model demonstrated true learning rather than memorization
Technical Insights
- Hyperparameter sweeps showed similar training patterns across:
- Different numbers of layers
- Various hidden dimensions
- Model struggled with fine-grained predictions of mutation effects
Limitations
- Poor performance in predicting effects of small sequence differences
- Limited ability to distinguish between mutations of the same protein
- Challenges in cell-line-specific predictions
Technical Details
- CrisprScore distribution showed varied effects of protein deletion
- Different proteins showed distinct patterns of effect across cell lines
- Model performance was consistent across different architectural choices
Future Implications
- Need for improved mutation-specific prediction capabilities
- Potential for enhanced protein function understanding
- Opportunity for better cancer-specific protein effect prediction