File size: 8,551 Bytes
7718235 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
# PreMode
This is the repository for our manuscript "PreMode predicts mode-of-action of missense variants by deep graph representation learning of protein sequence and structural context" posted on bioRxiv:
# Data
Please use the git lfs to download all files in `data.files/` folder
Unzip the files with this script:
Unfortunately we are not allowed to share the HGMD data, so in the `data.files/pretrain/training.*` files we removed all the pathogenic variants from HGMD (49218 in total). This might affect the plots of `analysis/figs/fig.sup.12.pdf` and `analysis/figs/fig.sup.13.pdf` if you re-run the R codes in `analysis/` folder.
We shared the trained weights of our models trained using HGMD instead.
# Install Packages
Please install the necessary packages using
mamba env create -f PreMode.yaml
mamba env create -f r4-base.yaml
You can check the installation by running
conda activate PreMode
python --conf scripts/TEST.yaml --mode train
If no error occurs, it means successful installation.
# New Experiment
## Start from scratch and use our G/LoF datasets
1. Please prepare a folder under `scripts/` and create a file named `pretrain.seed.0.yaml` inside the folder, check `scripts/PreMode/pretrain.seed.0.yaml` for example.
2. Run pretrain in pathogenicity task:
python --conf scripts/NEW_FOLDER/pretrain.seed.0.yaml
3. Prepare transfer learning config files:
bash scripts/ scripts/NEW_FOLDER/
4. Run transfer learning:
bash scripts/ scripts/NEW_FOLDER TASK_NAME GPU_ID
If you have multiple tasks, just separate each task by comma in the TASK_NAME, like "task_1,task_2,task_3".
5. (Optional) To reuse the transfer learning tasks in our paper using 8 GPU cards, just do
bash scripts/NEW_FOLDER
If you only have one GPU card, then do
bash scripts/NEW_FOLDER
6. Save inference results:
bash scripts/ scripts/NEW_FOLDER analysis/NEW_FOLDER TASK_NAME GPU_ID
7. You'll get a folder `analysis/NEW_FOLDER/TASK_NAME` with 5 `.csv` files, each file has 4 columns `logits.FOLD.[0-3]`. Each column represent the G/LoF prediction at one cross-validation (closer to 0 means more likely GoF, closer to 1 means more likely LoF). We suggest averaging the predictions at 4 columns.
## Only transfer learning, user defined mode-of-action datasets
1. Prepare a `.csv` file for training and inference, there are two accepted formats:
+ Format 1 (only for missense variants):
| uniprotID | aaChg | score | ENST |
| :-: | :-: | :-: | :-: |
| P15056 | p.V600E | 1 | ENST00000646891 |
| P15056 | p.G446V | -1 | ENST00000646891 |
+ `uniprotID`: the uniprot ID of the protein.
+ `aaChg`: the amino acid change induced by missense variant.
+ `score`: 1 for GoF, -1 for LoF. For inference it is not required. For DMS, this could be experimental readouts. If you have multiplexed assays, you can change it to `score.1, score.2, score.3, ..., score.N`.
+ `ENST` (optional): the ensemble transcript ID that matched the uniprotID.
+ Format 2 (can be missense variant or multiple variants):
| uniprotID | ref | alt | pos.orig | score | ENST | wt.orig | sequence.len.orig
| :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: |
| P15056 | V | E | 600 | 1 | ENST00000646891 | ... | 766 |
| P15056 | G | V | 446 | -1 | ENST00000646891 | ... | 766 |
| P15056 | G;V | V;F | 446;471 | -1 | ENST00000646891 | ... | 766 |
+ `uniprotID`: the uniprot ID of the protein.
+ `ref`: the reference amino acid, if multiple variants, separated by ";".
+ `alt`: the alternative, if multiple variants, separated by ";" in the same order of "ref".
+ `pos.orig`: the amino acid change position, if multiple variants, separated by ";" in the same order of "ref".
+ `score`: same as above.
+ `ENST` (optional): same as above.
+ `wt.orig`: the wild type protein sequence, in the uniprot format.
+ `sequence.len.orig`: the wild type protein sequence length.
+ If you prepared your input in Format 1, please run
bash parse.input.table/ YOUR_FILE TRANSFORMED_FILE
to transform it to Format 2, note it will drop some lines if your aaChg doesn't match the corresponding alphafold sequence.
2. Prepare a config file for training the model and inference.
+ `PRETRAIN_MODEL_NAME` could be one of the following:
+ `scripts/PreMode`: Default PreMode
+ `scripts/PreMode.ptm`: PreMode + ptm as input
+ `scripts/PreMode.noStructure`: PreMode without structure input
+ `scripts/PreMode.noESM`: PreMode, replaced ESM2 input with one-hot encodings of 20 AAs.
+ `scripts/PreMode.noMSA`: PreMode without MSA input
+ `scripts/ESM.SLP`: ESM embedding + Single Layer Perceptron
+ `YOUR_TASK_NAME` can be anything on your preference
+ `YOUR_TRAINING_FILE` is the training file you prepared in step 1.
+ `YOUR_INFERENCE_FILE` is the inference file you prepared in step 1.
+ `TASK_TYPE` could be `DMS` or `GLOF`.
+ `MODE_OF_ACTION_N` The number of dimensions of mode-of-action. For `GLOF` this is usually 1. For multiplexed `DMS` dataset, this could be the number of biochemical properties measured. Note that if it is larger than 1, then you have to make sure the `score` column in step 1 is replaced to `score.1, score.2, ..., score.N` correspondingly.
3. Run your config file
conda activate PreMode
This should take ~30min on a NVIDIA A40 GPU depending on your data set size.
4. You'll get a file in the `OUTPUT_FOLDER` named as `YOUR_TASK_NAME.inference.result.csv`.
+ If your `TASK_TYPE` is `GLOF`, then the column `logits` will be the inference results. Closer to 0 means GoF, closer to 1 means LoF.
+ If your `TASK_TYPE` is `DMS` and `MODE_OF_ACTION_N` is 1, then the column `logits` will be the inference results. If your `MODE_OF_ACTION_N` is larger than 1, then you will get multiple columns of `logits.*`, each represent a predicted DMS measurement.
# Models & Figures in our manuscript
## Pretrained Models
Here is the list of models in our manuscript:
`scripts/PreMode/` PreMode, it takes 250 GB RAM and 4 A40 Nvidia GPUs to run, will finish in ~50h.
`scripts/ESM.SLR/` Baseline Model, ESM2 (650M) + Single Layer Perceptron
`scripts/PreMode.large.window/` PreMode, window size set to 1251 AA.
`scripts/PreMode.noESM/` PreMode, replace the ESM2 embeddings to one hot encodings of 20 AA.
`scripts/PreMode.noMSA/` PreMode, remove the MSA input.
`scripts/PreMode.noPretrain/` PreMode, but didn't pretrain on ClinVar/HGMD.
`scripts/PreMode.noStructure/` PreMode, remove the AF2 predicted structure input.
`scripts/PreMode.ptm/` PreMode, add the onehot encoding of post transcriptional modification sites as input.
`scripts/PreMode.mean.var/` PreMode, it will output both predicted value (mean) and confidence (var), used in adaptive learning tasks.
## Predicted mode-of-action
| gene | file |
| :-: | :-: |
| BRAF | `analysis/5genes.all.mut/PreMode/P15056.logits.csv` |
| RET | `analysis/5genes.all.mut/PreMode/P07949.logits.csv` |
| TP53 | `analysis/5genes.all.mut/PreMode/P04637.logits.csv` |
| KCNJ11 | `analysis/5genes.all.mut/PreMode/Q14654.logits.csv` |
| CACNA1A | `analysis/5genes.all.mut/PreMode/O00555.logits.csv` |
| SCN5A | `analysis/5genes.all.mut/PreMode/Q14524.logits.csv` |
| SCN2A | `analysis/5genes.all.mut/PreMode/Q99250.logits.csv` |
| ABCC8 | `analysis/5genes.all.mut/PreMode/Q09428.logits.csv` |
| PTEN | `analysis/5genes.all.mut/PreMode/P60484.logits.csv` |
For each file, column `logits.0` is predicted pathogenicity. column `logits.1` is predicted LoF probability, `logits.2` is predicted GoF probability.
For PTEN, column `logits.1` is predicted stability, 0 is loss, 1 is neutral, `logits.2` is predicted enzyme activity, 0 is loss, 1 is neutral
## Figures
Please go to `analysis/` folder and run the corresponding R scripts.