adamyhe commited on
Commit
34a896b
1 Parent(s): a606189

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +133 -3
README.md CHANGED
@@ -1,3 +1,133 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ ---
6
+ # CLIPNET
7
+
8
+ CLIPNET (Convolutionally Learned, Initiation-Predicting NETwork) is an ensembled convolutional neural network that predicts transcription initiation from DNA sequence at single nucleotide resolution. We describe CLIPNET in our [preprint](https://www.biorxiv.org/content/10.1101/2024.03.13.583868) on bioRxiv. This repository contains code for working with CLIPNET, namely for generating predictions and feature interpretations and performing *in silico* mutagenesis scans. To reproduce the figures in our paper, please see the [clipnet_paper GitHub repo](https://github.com/Danko-Lab/clipnet_paper/).
9
+
10
+ ## Installation
11
+
12
+ To install CLIPNET, first clone the GitHub repository:
13
+
14
+ ```bash
15
+ git clone https://github.com/Danko-Lab/clipnet.git
16
+ cd clipnet
17
+ ```
18
+
19
+ Then, install dependencies using pip. We recommend creating an isolated environment for working with CLIPNET. For example, with conda/mamba:
20
+
21
+ ```bash
22
+ mamba create -n clipnet -c conda-forge gcc~=12.1 python=3.9
23
+ mamba activate clipnet
24
+ pip install -r requirements.txt # requirements_cpu.txt if no GPU
25
+ ```
26
+
27
+ You may need to configure your CUDA/cudatoolkit/cudnn paths to get GPU support working. See the [tensorflow documentation](https://www.tensorflow.org/install/gpu) for more information.
28
+
29
+ ## Download models
30
+
31
+ Pretrained CLIPNET models are available on [Zenodo](https://zenodo.org/doi/10.5281/zenodo.10408622). Download the models into the `ensemble_models` directory:
32
+
33
+ ```bash
34
+ for fold in {1..9};
35
+ do wget https://zenodo.org/records/10408623/files/fold_${fold}.h5 -P ensemble_models/;
36
+ done
37
+ ```
38
+
39
+ Alternatively, they can be accessed via [HuggingFace](https://huggingface.co/adamyhe/clipnet).
40
+
41
+ ## Usage
42
+
43
+ ### Input data
44
+
45
+ CLIPNET was trained on a [population-scale PRO-cap dataset](http://dx.doi.org/10.1038/s41467-020-19829-z) derived from human lymphoblastoid cell lines, matched with individualized genome sequences (1kGP). CLIPNET accepts 1000 bp sequences as input and imputes PRO-cap coverage (RPM) in the center 500 bp.
46
+
47
+ CLIPNET can either work on haploid reference sequences (e.g. hg38) or on individualized sequences (e.g. 1kGP). When constructing individualized sequences, we made two major simplifications: (1) We considered only SNPs and (2) we used unphased SNP genotypes.
48
+
49
+ We encode sequences using a "two-hot" encoding. That is, we encoded each individual nucleotide at a given position using a one-hot encoding scheme, then represented the unphased diploid sequence as the sum of the two one-hot encoded nucleotides at each position. The sequence "AYCR", for example, would be encoded as: `[[2, 0, 0, 0], [0, 1, 0, 1], [0, 2, 0, 0], [1, 0, 1, 0]]`.
50
+
51
+ ### Command line interface
52
+
53
+ #### Predictions
54
+
55
+ To generate predictions using the ensembled model, use the `predict_ensemble.py` script (the `predict_individual_model.py` script can be used to generate predictions with individual model folds). This script takes a fasta file containing 1000 bp records and outputs an hdf5 file containing the predictions for each record. For example:
56
+
57
+ ```bash
58
+ python predict_ensemble.py data/test.fa data/test_predictions.h5 --gpu
59
+ # Use the --gpu flag to run on GPU
60
+ ```
61
+
62
+ To input individualized sequences, heterozygous positions should be represented using the IUPAC ambiguity codes R (A/G), Y (C/T), S (C/G), W (A/T), K (G/T), M (A/C).
63
+
64
+ The output hdf5 file will contain two datasets: "track" and "quantity". The track output of the model is a length 1000 vector (500 plus strand concatenated with 500 minus strand) representing the predicted base-resolution profile/shape of initiation. The quantity output represents the total PRO-cap quantity on both strands.
65
+
66
+ We note that the track node was not optimized for quantity prediction. As a result, the sum of the track node is not well correlated with the quantity prediction and not a good predictor of the total quantity of initiation. We therefore recommend rescaling the track predictions to sum to the quantity prediction. For example:
67
+
68
+ ```python
69
+ import h5py
70
+ import numpy as np
71
+
72
+ with h5py.File("data/test.h5", "r") as f:
73
+ profile = f["track"][:]
74
+ quantity = f["quantity"][:]
75
+ profile_scaled = (profile / np.sum(profile, axis=1)[:, None]) * quantity
76
+ ```
77
+
78
+ #### Feature interpretations
79
+
80
+ CLIPNET uses DeepSHAP to generate feature interpretations. To generate feature interpretations, use the `calculate_deepshap.py` script. This script takes a fasta file containing 1000 bp records and outputs two npz files containing: (1) feature interpretations for each record and (2) onehot-encoded sequence. It supports two modes that can be set with `--mode`: "profile" and "quantity". The "profile" mode calculates interpretations for the profile node of the model (using the profile metric proposed in BPNet), while the "quantity" mode calculates interpretations for the quantity node of the model.
81
+
82
+ ```bash
83
+ python calculate_deepshap.py \
84
+ data/test.fa \
85
+ data/test_deepshap_quantity.npz \
86
+ data/test_onehot.npz \
87
+ --mode quantity \
88
+ --gpu
89
+
90
+ python calculate_deepshap.py \
91
+ data/test.fa \
92
+ data/test_deepshap_profile.npz \
93
+ data/test_onehot.npz \
94
+ --mode profile \
95
+ --gpu
96
+ ```
97
+
98
+ Note that CLIPNET generally accepts two-hot encoded sequences as input, with the array being structured as (# sequences, 1000, 4). However, feature interpretations are much easier to do with just a haploid/fully homozygous genome, so we recommend just doing interpretations on the reference genome sequence. tfmodisco-lite also expects contribution scores and sequence arrays to be length last, i.e., (# sequences, 4, 1000), with the sequence array being one-hot. To accomodate these, `calculate_deepshap.py` will automatically convert the input sequence array to length last and onehot encoded, and will also write the output contribution scores as length last. Also note that these are actual contribution scores, as opposed to hypothetical contribution scores. Specifically, non-reference nucleotides are set to zero. The outputs of this model can be used as inputs to tfmodisco-lite to generate motif logos and motif tracks.
99
+
100
+ Both DeepSHAP and tfmodisco-lite computations are quite slow when performed on a large number of sequences, so we (a) recommend running DeepSHAP on a GPU using the `--gpu` flag and (b) if you have access to many GPUs, calculating DeepSHAP scores for the model folds in parallel using the `--model_fp` flag, then averaging them. We also provide precomputed DeepSHAP scores and TF-MoDISco results for a genome-wide set of PRO-cap peaks called in the LCL dataset (https://zenodo.org/records/10597358).
101
+
102
+ #### Genomic *in silico* mutagenesis scans
103
+
104
+ To generate genomic *in silico* mutagenesis scans, use the `calculate_ism_shuffle.py` script. This script takes a fasta file containing 1000 bp records and outputs an npz file containing the ISM shuffle results ("corr_ism_shuffle" and "log_quantity_ism_shuffle") for each record. For example:
105
+
106
+ ```bash
107
+ python calculate_ism_shuffle.py data/test.fa data/test_ism.npz --gpu
108
+ ```
109
+
110
+ ### API usage
111
+
112
+ CLIPNET models can be directly loaded as follows. Individual models can simply be loaded using tensorflow:
113
+
114
+ ```python
115
+ import tensorflow as tf
116
+
117
+ nn = tf.keras.models.load_model("ensemble_models/fold_1.h5", compile=False)
118
+ ```
119
+
120
+ The model ensemble is constructed by averaging track and quantity outputs across all 9 model folds. To make this easy, we've provided a simple API in the `clipnet.CLIPNET` class for doing this. Moreover, to make reading fasta files into the correct format easier, we've provided the helper function `utils.twohot_fasta`. For example:
121
+
122
+ ```python
123
+ import sys
124
+ sys.path.append(PATH_TO_THIS_DIRECTORY)
125
+ import clipnet
126
+ import utils
127
+
128
+ nn = clipnet.CLIPNET(n_gpus=0) # by default, this will be 1 and will use CUDA
129
+ ensemble = nn.construct_ensemble()
130
+ seqs = utils.twohot_fasta("data/test.fa")
131
+
132
+ predictions = ensemble.predict(seqs)
133
+ ```