File size: 8,551 Bytes
# PreMode
This is the repository for our manuscript "PreMode predicts mode-of-action of missense variants by deep graph representation learning of protein sequence and structural context" posted on bioRxiv: https://www.biorxiv.org/content/10.1101/2024.02.20.581321v2

# Data
Please use the git lfs to download all files in `data.files/` folder

Unzip the files with this script: 
```
bash unzip.files.sh
```

Unfortunately we are not allowed to share the HGMD data, so in the `data.files/pretrain/training.*` files we removed all the pathogenic variants from HGMD (49218 in total). This might affect the plots of `analysis/figs/fig.sup.12.pdf` and `analysis/figs/fig.sup.13.pdf` if you re-run the R codes in `analysis/` folder.

We shared the trained weights of our models trained using HGMD instead. 

# Install Packages
Please install the necessary packages using 
```
mamba env create -f PreMode.yaml
mamba env create -f r4-base.yaml
```

You can check the installation by running
```
conda activate PreMode
python train.py --conf scripts/TEST.yaml --mode train
```
If no error occurs, it means successful installation.

# New Experiment 
## Start from scratch and use our G/LoF datasets
1.  Please prepare a folder under `scripts/` and create a file named `pretrain.seed.0.yaml` inside the folder, check `scripts/PreMode/pretrain.seed.0.yaml` for example. 
2.  Run pretrain in pathogenicity task: 
    ```
    python train.py --conf scripts/NEW_FOLDER/pretrain.seed.0.yaml
    ```
3.  Prepare transfer learning config files: 
    ```
    bash scripts/DMS.prepare.yaml.sh scripts/NEW_FOLDER/
    ```
4.  Run transfer learning: 
    ```
    bash scripts/DMS.5fold.run.sh scripts/NEW_FOLDER TASK_NAME GPU_ID
    ```
    If you have multiple tasks, just separate each task by comma in the TASK_NAME, like "task_1,task_2,task_3". 
5.  (Optional) To reuse the transfer learning tasks in our paper using 8 GPU cards, just do 
    ```
    bash transfer.all.sh scripts/NEW_FOLDER
    ```
    If you only have one GPU card, then do 
    ```
    bash transfer.all.in.one.card.sh scripts/NEW_FOLDER
    ```
6.  Save inference results: 
    ```
    bash scripts/DMS.5fold.inference.sh scripts/NEW_FOLDER analysis/NEW_FOLDER TASK_NAME GPU_ID
    ```
7.  You'll get a folder `analysis/NEW_FOLDER/TASK_NAME` with 5 `.csv` files, each file has 4 columns `logits.FOLD.[0-3]`. Each column represent the G/LoF prediction at one cross-validation (closer to 0 means more likely GoF, closer to 1 means more likely LoF). We suggest averaging the predictions at 4 columns. 

## Only transfer learning, user defined mode-of-action datasets
1.  Prepare a `.csv` file for training and inference, there are two accepted formats: 
    + Format 1 (only for missense variants):
        | uniprotID | aaChg  | score | ENST |
        | :-: | :-: | :-: | :-: |
        | P15056 | p.V600E | 1 | ENST00000646891 |
        | P15056 | p.G446V | -1 | ENST00000646891 |
      + `uniprotID`: the uniprot ID of the protein.
      + `aaChg`: the amino acid change induced by missense variant.
      + `score`: 1 for GoF, -1 for LoF. For inference it is not required. For DMS, this could be experimental readouts. If you have multiplexed assays, you can change it to `score.1, score.2, score.3, ..., score.N`.
      + `ENST` (optional): the ensemble transcript ID that matched the uniprotID.
    + Format 2 (can be missense variant or multiple variants):
        | uniprotID | ref | alt | pos.orig | score | ENST | wt.orig | sequence.len.orig
        | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: |
        | P15056 | V | E | 600 | 1 | ENST00000646891 | ... | 766 |
        | P15056 | G | V | 446 | -1 | ENST00000646891 | ... | 766 |
        | P15056 | G;V | V;F | 446;471 | -1 | ENST00000646891 | ... | 766 |
      + `uniprotID`: the uniprot ID of the protein.
      + `ref`: the reference amino acid, if multiple variants, separated by ";".
      + `alt`: the alternative, if multiple variants, separated by ";" in the same order of "ref".
      + `pos.orig`: the amino acid change position, if multiple variants, separated by ";" in the same order of "ref".
      + `score`: same as above.
      + `ENST` (optional): same as above.
      + `wt.orig`: the wild type protein sequence, in the uniprot format.
      + `sequence.len.orig`: the wild type protein sequence length.

    + If you prepared your input in Format 1, please run 
        ```
        bash parse.input.table/parse.input.table.sh YOUR_FILE TRANSFORMED_FILE
        ``` 
      to transform it to Format 2, note it will drop some lines if your aaChg doesn't match the corresponding alphafold sequence.
2.  Prepare a config file for training the model and inference. 
    ```
    bash scripts/prepare.new.task.yaml.sh PRETRAIN_MODEL_NAME YOUR_TASK_NAME YOUR_TRAINING_FILE YOUR_INFERENCE_FILE TASK_TYPE MODE_OF_ACTION_N
    ```
    + `PRETRAIN_MODEL_NAME` could be one of the following:
      + `scripts/PreMode`: Default PreMode
      + `scripts/PreMode.ptm`: PreMode + ptm as input
      + `scripts/PreMode.noStructure`: PreMode without structure input
      + `scripts/PreMode.noESM`: PreMode, replaced ESM2 input with one-hot encodings of 20 AAs.
      + `scripts/PreMode.noMSA`: PreMode without MSA input
      + `scripts/ESM.SLP`: ESM embedding + Single Layer Perceptron
    + `YOUR_TASK_NAME` can be anything on your preference
    + `YOUR_TRAINING_FILE` is the training file you prepared in step 1.
    + `YOUR_INFERENCE_FILE` is the inference file you prepared in step 1.
    + `TASK_TYPE` could be `DMS` or `GLOF`.
    + `MODE_OF_ACTION_N` The number of dimensions of mode-of-action. For `GLOF` this is usually 1. For multiplexed `DMS` dataset, this could be the number of biochemical properties measured. Note that if it is larger than 1, then you have to make sure the `score` column in step 1 is replaced to `score.1, score.2, ..., score.N` correspondingly.
  
3.  Run your config file
    ```
    conda activate PreMode
    bash scripts/run.new.task.sh PRETRAIN_MODEL_NAME YOUR_TASK_NAME OUTPUT_FOLDER GPU_ID 
    ```
    This should take ~30min on a NVIDIA A40 GPU depending on your data set size.

4.  You'll get a file in the `OUTPUT_FOLDER` named as `YOUR_TASK_NAME.inference.result.csv`. 
    + If your `TASK_TYPE` is `GLOF`, then the column `logits` will be the inference results. Closer to 0 means GoF, closer to 1 means LoF.
    + If your `TASK_TYPE` is `DMS` and `MODE_OF_ACTION_N` is 1, then the column `logits` will be the inference results. If your `MODE_OF_ACTION_N` is larger than 1, then you will get multiple columns of `logits.*`, each represent a predicted DMS measurement.


# Models & Figures in our manuscript
## Pretrained Models
Here is the list of models in our manuscript:

`scripts/PreMode/` PreMode, it takes 250 GB RAM and 4 A40 Nvidia GPUs to run, will finish in ~50h.

`scripts/ESM.SLR/` Baseline Model, ESM2 (650M) + Single Layer Perceptron

`scripts/PreMode.large.window/` PreMode, window size set to 1251 AA.

`scripts/PreMode.noESM/`  PreMode, replace the ESM2 embeddings to one hot encodings of 20 AA.

`scripts/PreMode.noMSA/`  PreMode, remove the MSA input.

`scripts/PreMode.noPretrain/` PreMode, but didn't pretrain on ClinVar/HGMD.

`scripts/PreMode.noStructure/` PreMode, remove the AF2 predicted structure input.

`scripts/PreMode.ptm/` PreMode, add the onehot encoding of post transcriptional modification sites as input.

`scripts/PreMode.mean.var/` PreMode, it will output both predicted value (mean) and confidence (var), used in adaptive learning tasks.

## Predicted mode-of-action
| gene | file | 
| :-: | :-: |
| BRAF | `analysis/5genes.all.mut/PreMode/P15056.logits.csv` |
| RET | `analysis/5genes.all.mut/PreMode/P07949.logits.csv` |
| TP53 | `analysis/5genes.all.mut/PreMode/P04637.logits.csv` |
| KCNJ11 | `analysis/5genes.all.mut/PreMode/Q14654.logits.csv` |
| CACNA1A | `analysis/5genes.all.mut/PreMode/O00555.logits.csv` |
| SCN5A | `analysis/5genes.all.mut/PreMode/Q14524.logits.csv` |
| SCN2A | `analysis/5genes.all.mut/PreMode/Q99250.logits.csv` |
| ABCC8 | `analysis/5genes.all.mut/PreMode/Q09428.logits.csv` |
| PTEN | `analysis/5genes.all.mut/PreMode/P60484.logits.csv` |

For each file, column `logits.0` is predicted pathogenicity. column `logits.1` is predicted LoF probability, `logits.2` is predicted GoF probability.
For PTEN, column `logits.1` is predicted stability, 0 is loss, 1 is neutral, `logits.2` is predicted enzyme activity, 0 is loss, 1 is neutral

## Figures 
Please go to `analysis/` folder and run the corresponding R scripts.