# Learning protein fitness models from evolutionary and assay-labelled data

This repo is a collection of code and scripts for evaluating methods that combine evolutionary and assay-labelled data for protein fitness prediction.

For more details, please see our pre-print [Combining evolutionary and assay-labelled data for protein fitness prediction](https://www.biorxiv.org/content/10.1101/2021.03.28.437402v1.abstract).

## Contents
- Repo contents
- System requirements
- Installation
- Demo
- Jackhmmer search
- Fitness data
- Density models
- Predictors


## Repo contents
There are several main components of the repo.
- `data`: Processed protein fitness data. (Only one example data set is provided here due to GitHub repo size constraints. Please download all alignments from Dryad doi:10.6078/D1K71B.)
- `alignments`: Processed multiple sequence alignments. (Only one example alignment is provided here due to GitHub repo size constraints. Please download all alignments from Dryad doi:10.6078/D1K71B.)
- `scripts`: Bash and Python scripts for data collection and data analysis.
- `src`: Python code for training and evaluating the methods assessed in the paper.
  Also includes the evaluation and comparison framework of different predictors.
- `environment.yml`: Software dependencies for conda environment.
  
When running the provided scripts, the outputs will be written to the following directories:
- `inference`: Directory for intermediate files such as inferred sequence log likelihoods.
- `results`: Directory for results as csv files.

## System requirements

### Hardware requirements
Some of the methods, in particular DeepSequence VAE, UniRep mLSTM, and ESM
Transformer, require GPU for training and inference. The GPU code in this repo
has been tested on NVIDIA Quadro RTX 8000 GPU.

Evaluating all the methods each with 20 random seeds, 19 data sets, and 10
training setups would require a relatively long time on a single core. Our
evaluation code supports multiprocessing and has been tested on 32 cores.

For storing all intermediate files for all methods and all data sets,
approximately 100G of disk space will be needed.

### Software requirements
The code has been tested on Ubuntu 18.04.5 LTS (Bionic Beaver) with conda 4.10.0
and Python 3.8.5.
The (optional) slurm scripts have been tested on slurm 17.11.12.
The list of software dependencies are provided in the `environment.yml` file.

## Installation

1. Create the conda environment from the environment.yml file:
```
    conda env create -f environment.yml
```

2. Activate the new conda environment:
```
    conda activate protein_fitness_prediction
```

3. Install the [plmc package](https://github.com/debbiemarkslab/plmc):
```
    cd $HOME (or use another directory for plmc <directory_to_install_plmc> and
modify `scripts/plmc.sh` accordingly with the custom directory)
    git clone https://github.com/debbiemarkslab/plmc.git
    cd plmc
    make all-openmp
```

The installation should finish in a few minutes.

## Demo 
The one-hot linear model is the simplest example as it only requires
assay-labelled data. To evaluate the one-hot linear model on the Poly(A)-binding
protein (PABP) data with 240 training examples and 20 seeds on a single core:
```
    python src/evaluate.py BLAT_ECOLX_Ranganathan2015-2500 onehot --n_seeds=20 --n_threads=1 --n_train=240
```

When the program finishes, the results from the 20 runs will be available in the
file `results/BLAT_ECOLX_Ranganathan2015-2500/results.csv`.

As another example that involves both evolutionary and assay-lablled data, here
we show the process to evaluate the augmented Potts model on the same protein.

The multiple sequence alignments (MSAs) are available in the `alignments`
directory for all proteins used in our assessment. For other proteins, 
MSAs can be retrieved by jackhmmer search (see the jackhmmer search section).

From the MSA, first run PLMC to estimate the couplings model:
```
    bash scripts/plmc.sh BLAT_ECOLX BLAT_ECOLX_Ranganathan2015-2500 
```
The resulting models are saved at `inference/BLAT_ECOLX_Ranganathan2015-2500/plmc`.

Then, similar to the one-hot linear model evaluation, run:
```
    python src/evaluate.py BLAT_ECOLX_Ranganathan2015-2500 ev+onehot --n_seeds=20 --n_threads=1 --n_train=240
```
The evaluation should finish in a few minutes, and all results will be saved to
`results/BLAT_ECOLX_Ranganathan2015-2500/results.csv`.

Here, `ev+onehot` refers to the augmented Potts model. Other models and data
sets can also be similarly evaluated as long as the corresponding prerequisite
files are present in the inference directory.

## Jackhmmer search
1. Downloaded UniRef100 in fasta format from [UniProt](https://www.uniprot.org/downloads). 
2. Index the uniref100 fasta file into ssi with
```
    esl-sfetch --index <seqdb>.fasta
```
3. Set the file location of the fasta file in `scripts/jackhmmer.sh`.
4. To run jackhmmer, use `scripts/jackhmmer.sh` to search the local fasta
file.  In addition to running jackhmmer search, the script also implicitly calls
the other file conversion scripts. For example, it extracts target ids from the
jackhmmer tabular output by calling `scripts/tblout2ids.py`; converts the
fasta output to list of sequences by `scripts/fasta2txt.py`; and splits the
sequences into train and validation with `scripts/randsplit.py`.
5. The outputs of the jackhmmer script will be in
`jackhmmer/<dataset>/<run_id>`, where each iteration's alignment is saved as
`iter-<N>.a2m` and the final alignment is saved as `alignment.a2m`. The list of
full length target sequences is at `target_seqs.fasta` and `target_seqs.txt`.

## Fitness data
In the example data set in the `data` directory (and also for all other data sets
available on Dryad), each subdirectory (e.g. `data/BLAT_ECOLX_Ranganathan2015-2500`)
represents a data set of interest. In the subdirectory, there are two key files.
- `wt.fasta` documents the WT sequence.
- `data.csv` contains three columns: `seq`, `log_fitness`, `n_mut`.
  `seq` is the sequence with mutation, and should be the same length as WT seq.
  `log_fitness` is the log enrichment ratio or other log-scale fitness values,
  where higher is better. Although referred to as `log_fitness` here, this
  corresponds to `fitness` in the paper.
  `n_mut` is how many mutations away the sequence is from WT, where 0 indicates WT.

## Density models

### Potts model
For learning a Potts model (EVmutation / plmc) from an MSA, see ``scripts/plmc.sh``.
The resulting couplings model files (saved to the inference directory) can be
directly parsed by our correpsonding `ev` and `ev+onehot` predictors.

### DeepSequence VAE
1. Install the [DeepSequence
package](https://github.com/debbiemarkslab/DeepSequence.git).
2. Put the DeepSequence package directory as `WORKING_DIR` in both `src/train_vae.py`
and `src/inference_vae.py`.
3. Use `scripts/train_vae.sh` for training a VAE model from an MSA.
4. For retrieving ELBOs from VAEs, see `scripts/inference_vae.sh`.
5. The saved elbo files in the inference directory can be parsed by the
corresponding `vae` and `vae+onehot` predictors.

### ESM
1. Follow the instructions from the [ESM
repo](https://github.com/facebookresearch/esm.git) to download the pre-trained
model weights.
2. Put the downloaded pre-trained weights location into
`scripts/inference_esm.sh`.
3. To retrieve ESM Transformer approximate pseudo-log likelihoods for sequences
in a fasta file, see `scripts/inference_esm.sh`. The results will be in the
inference directory and can be used by the `esm` and `esm+onehot` predictors.

### UniRep
1. Download the pre-trained UniRep weights (1900-unit) from the [UniRep
repo](https://github.com/churchlab/UniRep#obtaining-weight-files).
2. Put the location for the downloaded weights into `scripts/evotune_unirep.sh`.
3. Use the `scripts/evotune_unirep.sh` script to evotune the UniRep model with
an MSA file as `seqspath`.
4. Use `scripts/inference_unirep.sh` to calculate log-likelihoods from an
evotuned unirep model.

## Predictors
Each type of predictor is represented by a Python class in `src/predictors`.
A predictor class represents a prediction strategy for protein fitness that
depends on evolutionary data, assay-labelled data, or both.
The base predictor class, `BasePredictor` is defined at `src/predictors/base_predictors.py`.
All predictor classes inherit from this class and have the `train` and `predict` methods.
The `JointPredictor` class is a meta predictor that combines the features from multiple
existing predictor classes, and can be easily specified by the sub-predictor names.
See `src/predictors/__init__.py` for a full list of implemented predictors.