PascalNotin
commited on
Commit
•
d1ef627
1
Parent(s):
fa87656
Streamlined HF Hub repo
Browse files- README.md +5 -119
- __init__.py +0 -1
- activations.py +0 -114
- config.py +0 -36
- model_pytorch.py +0 -917
- outputs.py +0 -48
- utils/.DS_Store +0 -0
- utils/__init__.py +0 -1
- utils/dms_utils.py +0 -26
- utils/msa_utils.py +0 -361
- utils/scoring_utils.py +0 -192
- utils/tokenizers/Basic_tokenizer +0 -1
README.md
CHANGED
@@ -1,133 +1,19 @@
|
|
1 |
-
# Tranception
|
2 |
|
3 |
-
This
|
4 |
|
5 |
## Abstract
|
6 |
The ability to accurately model the fitness landscape of protein sequences is critical to a wide range of applications, from quantifying the effects of human variants on disease likelihood, to predicting immune-escape mutations in viruses and designing novel biotherapeutic proteins. Deep generative models of protein sequences trained on multiple sequence alignments have been the most successful approaches so far to address these tasks. The performance of these methods is however contingent on the availability of sufficiently deep and diverse alignments for reliable training. Their potential scope is thus limited by the fact many protein families are hard, if not impossible, to align. Large language models trained on massive quantities of non-aligned protein sequences from diverse families address these problems and show potential to eventually bridge the performance gap. We introduce Tranception, a novel transformer architecture leveraging autoregressive predictions and retrieval of homologous sequences at inference to achieve state-of-the-art fitness prediction performance. Given its markedly higher performance on multiple mutants, robustness to shallow alignments and ability to score indels, our approach offers significant gain of scope over existing approaches. To enable more rigorous model testing across a broader range of protein families, we develop ProteinGym -- an extensive set of multiplexed assays of variant effects, substantially increasing both the number and diversity of assays compared to existing benchmarks.
|
7 |
|
8 |
-
## Setup
|
9 |
-
You may download the Tranception repository and create a conda environment with the proper dependencies (as listed in `tranception_env.yml`) as follows:
|
10 |
-
```
|
11 |
-
git clone https://github.com/OATML-Markslab/Tranception.git
|
12 |
-
conda env create -f tranception_env.yml
|
13 |
-
```
|
14 |
-
|
15 |
-
## Tranception
|
16 |
-
Tranception is a novel autoregressive transformer architecture that was designed with two core principles in mind: 1) promoting specialization across attention heads 2) explicitly extracting patterns from contiguous subsequences.
|
17 |
-
|
18 |
-
To download the *Tranception Large* model checkpoint (~3.1GB unzipped):
|
19 |
-
```
|
20 |
-
curl -o Tranception_Large.zip https://marks.hms.harvard.edu/ProteinGym/Tranception_Large.zip
|
21 |
-
unzip Tranception_Large.zip
|
22 |
-
rm Tranception_Large.zip
|
23 |
-
```
|
24 |
-
|
25 |
-
Tranception is also made available through the [Huggging Face hub](https://huggingface.co/OATML-Markslab/Tranception).
|
26 |
-
|
27 |
-
When scoring with retrieval, we compute weighted pseudocounts at each position using sequence weights as per the procedure described in [Hopf et al.](https://www.nature.com/articles/nbt.3769).
|
28 |
-
Weights for all proteins in the ProteinGym benchmarks may be downloaded as follows (~68M unzipped):
|
29 |
-
```
|
30 |
-
curl -o MSA_weights.zip https://marks.hms.harvard.edu/ProteinGym/MSA_weights.zip
|
31 |
-
unzip MSA_weights.zip
|
32 |
-
rm MSA_weights.zip
|
33 |
-
```
|
34 |
-
To compute sequence weights for new proteins, you may use the MSA_processing class under `tranception/utils/msa_utils.py`.
|
35 |
-
|
36 |
-
The `examples` folder provides several bash scripts that may be used for scoring and evaluating Tranception on the ProteinGym benchmarks. We also provide a colab notebook illustrating how to load Tranception from the Hugging Face hub and then score the mutated sequences in the ProteinGym benchmarks with it.
|
37 |
-
|
38 |
-
## ProteinGym
|
39 |
-
ProteinGym is an extensive set of Deep Mutational Scanning (DMS) assays curated to enable thorough comparisons of various mutation effect predictors indifferent regimes. ProteinGym is comprised of two benchmarks: 1) a substitution benchmark which consists of the experimental characterisation of ∼1.5M missense variants across 87 DMS assays 2) an indel benchmark that includes ∼300k mutants across 7 DMS assays.
|
40 |
-
|
41 |
-
Each processed file in each benchmark corresponds to a single DMS assay, and contains the following three variables:
|
42 |
-
- mutant (str):
|
43 |
-
- for the substitution benchmark, it describes the set of substitutions to apply on the reference sequence to obtain the mutated sequence (eg., A1P:D2N implies the amino acid 'A' at position 1 should be replaced by 'P', and 'D' at position 2 should be replaced by 'N')
|
44 |
-
- for the indel benchmark, it corresponds to the full mutated sequence
|
45 |
-
- DMS_score (float): corresponds to the experimental measurement in the DMS assay. Across all assays, the higher the DMS_score value, the higher the fitness of the mutated protein
|
46 |
-
- DMS_score_bin (int): indicates whether the DMS_score is above the fitness cutoff (1 is fit, 0 is not fit)
|
47 |
-
|
48 |
-
Additionally, we provide reference files in the [ProteinGym folder](https://github.com/OATML-Markslab/Tranception/tree/main/ProteinGym) that give further details on each assay and contain in particular:
|
49 |
-
- The UniProt_ID of the corresponding protein, along with taxon and MSA depth category
|
50 |
-
- The target sequence (target_seq) used in the assay
|
51 |
-
- Details on how the DMS_score was created from the raw files and how it was binarized
|
52 |
-
|
53 |
-
To download the substitution benchmark (~224M unzipped):
|
54 |
-
```
|
55 |
-
curl -o ProteinGym_substitutions.zip https://marks.hms.harvard.edu/ProteinGym/ProteinGym_substitutions.zip
|
56 |
-
unzip ProteinGym_substitutions.zip
|
57 |
-
rm ProteinGym_substitutions.zip
|
58 |
-
```
|
59 |
-
|
60 |
-
Similarly, to download the indel benchmark (~86M unzipped):
|
61 |
-
```
|
62 |
-
curl -o ProteinGym_indels.zip https://marks.hms.harvard.edu/ProteinGym/ProteinGym_indels.zip
|
63 |
-
unzip ProteinGym_indels.zip
|
64 |
-
rm ProteinGym_indels.zip
|
65 |
-
```
|
66 |
-
|
67 |
-
## Fitness prediction performance
|
68 |
-
|
69 |
-
The [proteingym folder](https://github.com/OATML-Markslab/Tranception/tree/main/ProteinGym) provides detailed performance files for Tranception and baselines on the two ProteinGym benchmarks.
|
70 |
-
|
71 |
-
We recommand to aggregate fitness pediction performance at the Uniprot ID level to avoid biasing results towards proteins for which several DMS assays are available in ProteinGym. The corresponding aggregated files are suffixed with "_Uniprot_level", while the non aggregated performance files are suffixed with "_DMS_level".
|
72 |
-
Furthermore, to enable fair comparison with models trained multiple-sequence alignments (eg., EVE, DeepSequence, EVmutation), we only evaluate on the subset of mutations where position coverage is deemed high enough by these models to make a prediction. The corresponding files are preffixed with "All_models_". For comprehensiveness, we also provide performance files on all possible mutants available in ProteinGym, comparing only with the baselines that are able to score all mutants.
|
73 |
-
Note that for the ProteinGym indel benchmark, baselines that are able to score indels do not have the aforementionned coverage constraints (ie., no distinction between "All_models_" and "All_mutants_") and there is at most one DMS per Uniprot_ID (ie., no difference between "_Uniprot_level" and "_DMS_level"). We thus only provide one set of performance metrics for that benchmark.
|
74 |
-
|
75 |
-
### ProteinGym substitution benchmark - Leaderboard
|
76 |
-
The table below provides the average Spearman's rank correlation between DMS experimental fitness measurements and fitness predictions from Tranception or other baselines on the ProteinGym substitution benchmark. Following the terminology introduced above, we report the performance at the "Uniprot" level for "All models".
|
77 |
-
|
78 |
-
Rank | Model name | Spearman | Reference
|
79 |
-
--- | --- | --- | --- |
|
80 |
-
1 | Ensemble Tranception & EVE | 0.476 | [Notin et al.](https://arxiv.org/abs/2205.13760)
|
81 |
-
2 | Tranception (w/ retrieval) | 0.451 | [Notin et al.](https://arxiv.org/abs/2205.13760)
|
82 |
-
3 | EVE | 0.448 | [Frazer et al.](https://www.nature.com/articles/s41586-021-04043-8)
|
83 |
-
4 | EVmutation | 0.427 | [Hopf et al.](https://www.nature.com/articles/nbt.3769)
|
84 |
-
5 | MSA Transformer | 0.422 | [Rao et al.](https://proceedings.mlr.press/v139/rao21a.html)
|
85 |
-
6 | DeepSequence | 0.415 | [Riesselman et al.](https://www.nature.com/articles/s41592-018-0138-4)
|
86 |
-
7 | Tranception (no retrieval) | 0.406 | [Notin et al.](https://arxiv.org/abs/2205.13760)
|
87 |
-
8 | Wavenet | 0.398 | [Shin et al.](https://www.nature.com/articles/s41467-021-22732-w)
|
88 |
-
9 | Site Independent | 0.397 | [Hopf et al.](https://www.nature.com/articles/nbt.3769)
|
89 |
-
10 | ESM-1v | 0.371 | [Meier et al.](https://proceedings.neurips.cc/paper/2021/hash/f51338d736f95dd42427296047067694-Abstract.html)
|
90 |
-
|
91 |
-
### ProteinGym indel benchmark - Leaderboard
|
92 |
-
The table below provides the average Spearman's rank correlation between DMS experimental fitness measurements and fitness predictions from Tranception or other baselines on the ProteinGym indel benchmark.
|
93 |
-
|
94 |
-
Rank | Model name | Spearman | Reference
|
95 |
-
--- | --- | --- | --- |
|
96 |
-
[Notin et al.](https://arxiv.org/abs/2205.13760)
|
97 |
-
[Notin et al.](https://arxiv.org/abs/2205.13760)
|
98 |
-
[Shin et al.](https://www.nature.com/articles/s41467-021-22732-w)
|
99 |
-
|
100 |
-
## Aggregated model scoring files
|
101 |
-
The scores for all DMS assays in the ProteinGym substitution benchmark for Tranception and other baselines (eg., EVE, Wavenet, ESM-1v, MSA Transformer) may be downloaded as follows;
|
102 |
-
```
|
103 |
-
curl -o scores_all_models_proteingym_substitutions.zip https://marks.hms.harvard.edu/ProteinGym/scores_all_models_proteingym_substitutions.zip
|
104 |
-
unzip scores_all_models_proteingym_substitutions.zip
|
105 |
-
rm scores_all_models_proteingym_substitutions.zip
|
106 |
-
```
|
107 |
-
Similarly for the indel benchmark, all scoring files may be downloaded as follows:
|
108 |
-
```
|
109 |
-
curl -o scores_all_models_proteingym_indels.zip https://marks.hms.harvard.edu/ProteinGym/scores_all_models_proteingym_indels.zip
|
110 |
-
unzip scores_all_models_proteingym_indels.zip
|
111 |
-
rm scores_all_models_proteingym_indels.zip
|
112 |
-
```
|
113 |
-
|
114 |
-
## Multiple Sequence Alignments (MSAs)
|
115 |
-
|
116 |
-
The MSAs used to train alignment-based methods or used at inference in Tranception with retrieval and MSA Transformer may be downloaded as follows (~2.2GB unzipped):
|
117 |
-
```
|
118 |
-
curl -o MSA_ProteinGym.zip https://marks.hms.harvard.edu/ProteinGym/MSA_ProteinGym.zip
|
119 |
-
unzip MSA_ProteinGym.zip
|
120 |
-
rm MSA_ProteinGym.zip
|
121 |
-
```
|
122 |
-
|
123 |
## License
|
124 |
This project is available under the MIT license.
|
125 |
|
126 |
## Reference
|
127 |
-
If you use Tranception
|
128 |
```
|
129 |
Notin, P., Dias, M., Frazer, J., Marchena-Hurtado, J., Gomez, A., Marks, D.S., Gal, Y. (2022). Tranception: Protein Fitness Prediction with Autoregressive Transformers and Inference-time Retrieval. ICML.
|
130 |
```
|
131 |
|
132 |
## Links
|
133 |
-
Pre-print: https://arxiv.org/abs/2205.13760
|
|
|
|
1 |
+
# Tranception model
|
2 |
|
3 |
+
This Hugging Face Hub repo contains the model checkpoint for the Tranception model as described in our paper ["Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval"](https://arxiv.org/abs/2205.13760). The official GitHub repository can be accessed [here](https://github.com/OATML-Markslab/Tranception). This project is a joint collaboration between the [Marks lab](https://www.deboramarkslab.com/) and the [OATML group](https://oatml.cs.ox.ac.uk/).
|
4 |
|
5 |
## Abstract
|
6 |
The ability to accurately model the fitness landscape of protein sequences is critical to a wide range of applications, from quantifying the effects of human variants on disease likelihood, to predicting immune-escape mutations in viruses and designing novel biotherapeutic proteins. Deep generative models of protein sequences trained on multiple sequence alignments have been the most successful approaches so far to address these tasks. The performance of these methods is however contingent on the availability of sufficiently deep and diverse alignments for reliable training. Their potential scope is thus limited by the fact many protein families are hard, if not impossible, to align. Large language models trained on massive quantities of non-aligned protein sequences from diverse families address these problems and show potential to eventually bridge the performance gap. We introduce Tranception, a novel transformer architecture leveraging autoregressive predictions and retrieval of homologous sequences at inference to achieve state-of-the-art fitness prediction performance. Given its markedly higher performance on multiple mutants, robustness to shallow alignments and ability to score indels, our approach offers significant gain of scope over existing approaches. To enable more rigorous model testing across a broader range of protein families, we develop ProteinGym -- an extensive set of multiplexed assays of variant effects, substantially increasing both the number and diversity of assays compared to existing benchmarks.
|
7 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
## License
|
9 |
This project is available under the MIT license.
|
10 |
|
11 |
## Reference
|
12 |
+
If you use Tranception or other files provided through our GitHub repository, please cite the following paper:
|
13 |
```
|
14 |
Notin, P., Dias, M., Frazer, J., Marchena-Hurtado, J., Gomez, A., Marks, D.S., Gal, Y. (2022). Tranception: Protein Fitness Prediction with Autoregressive Transformers and Inference-time Retrieval. ICML.
|
15 |
```
|
16 |
|
17 |
## Links
|
18 |
+
Pre-print: https://arxiv.org/abs/2205.13760
|
19 |
+
GitHub: https://github.com/OATML-Markslab/Tranception
|
__init__.py
DELETED
@@ -1 +0,0 @@
|
|
1 |
-
from . import config
|
|
|
|
activations.py
DELETED
@@ -1,114 +0,0 @@
|
|
1 |
-
import math
|
2 |
-
|
3 |
-
import torch
|
4 |
-
from packaging import version
|
5 |
-
from torch import nn
|
6 |
-
|
7 |
-
from transformers.utils import logging
|
8 |
-
|
9 |
-
|
10 |
-
logger = logging.get_logger(__name__)
|
11 |
-
|
12 |
-
|
13 |
-
def _gelu_python(x):
|
14 |
-
"""
|
15 |
-
Original Implementation of the GELU activation function in Google BERT repo when initially created. For
|
16 |
-
information: OpenAI GPT's GELU is slightly different (and gives slightly different results): 0.5 * x * (1 +
|
17 |
-
torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3)))) This is now written in C in nn.functional
|
18 |
-
Also see the Gaussian Error Linear Units paper: https://arxiv.org/abs/1606.08415
|
19 |
-
"""
|
20 |
-
return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
|
21 |
-
|
22 |
-
|
23 |
-
def gelu_new(x):
|
24 |
-
"""
|
25 |
-
Implementation of the GELU activation function currently in Google BERT repo (identical to OpenAI GPT). Also see
|
26 |
-
the Gaussian Error Linear Units paper: https://arxiv.org/abs/1606.08415
|
27 |
-
"""
|
28 |
-
return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))
|
29 |
-
|
30 |
-
|
31 |
-
if version.parse(torch.__version__) < version.parse("1.4"):
|
32 |
-
gelu = _gelu_python
|
33 |
-
else:
|
34 |
-
gelu = nn.functional.gelu
|
35 |
-
|
36 |
-
|
37 |
-
def gelu_fast(x):
|
38 |
-
return 0.5 * x * (1.0 + torch.tanh(x * 0.7978845608 * (1.0 + 0.044715 * x * x)))
|
39 |
-
|
40 |
-
|
41 |
-
def quick_gelu(x):
|
42 |
-
return x * torch.sigmoid(1.702 * x)
|
43 |
-
|
44 |
-
|
45 |
-
def _silu_python(x):
|
46 |
-
"""
|
47 |
-
See Gaussian Error Linear Units (Hendrycks et al., https://arxiv.org/abs/1606.08415) where the SiLU (Sigmoid Linear
|
48 |
-
Unit) was originally introduced and coined, and see Sigmoid-Weighted Linear Units for Neural Network Function
|
49 |
-
Approximation in Reinforcement Learning (Elfwing et al., https://arxiv.org/abs/1702.03118) and Swish: a Self-Gated
|
50 |
-
Activation Function (Ramachandran et al., https://arxiv.org/abs/1710.05941v1) where the SiLU was experimented with
|
51 |
-
later.
|
52 |
-
"""
|
53 |
-
return x * torch.sigmoid(x)
|
54 |
-
|
55 |
-
|
56 |
-
if version.parse(torch.__version__) < version.parse("1.7"):
|
57 |
-
silu = _silu_python
|
58 |
-
else:
|
59 |
-
silu = nn.functional.silu
|
60 |
-
|
61 |
-
|
62 |
-
def _mish_python(x):
|
63 |
-
"""
|
64 |
-
See Mish: A Self-Regularized Non-Monotonic Activation Function (Misra., https://arxiv.org/abs/1908.08681). Also
|
65 |
-
visit the official repository for the paper: https://github.com/digantamisra98/Mish
|
66 |
-
"""
|
67 |
-
return x * torch.tanh(nn.functional.softplus(x))
|
68 |
-
|
69 |
-
|
70 |
-
if version.parse(torch.__version__) < version.parse("1.9"):
|
71 |
-
mish = _mish_python
|
72 |
-
else:
|
73 |
-
mish = nn.functional.mish
|
74 |
-
|
75 |
-
|
76 |
-
def linear_act(x):
|
77 |
-
return x
|
78 |
-
|
79 |
-
def squared_relu(x):
|
80 |
-
"""
|
81 |
-
Squared ReLU variant that is fastest with Pytorch.
|
82 |
-
"""
|
83 |
-
x = nn.functional.relu(x)
|
84 |
-
return x*x
|
85 |
-
|
86 |
-
def squared_relu_xla(x):
|
87 |
-
"""
|
88 |
-
Squared ReLU variant that is fastest with JAX.
|
89 |
-
"""
|
90 |
-
x = nn.functional.relu(x)
|
91 |
-
return x**2
|
92 |
-
|
93 |
-
tranception_ACT2FN = {
|
94 |
-
"relu": nn.functional.relu,
|
95 |
-
"silu": silu,
|
96 |
-
"swish": silu,
|
97 |
-
"gelu": gelu,
|
98 |
-
"tanh": torch.tanh,
|
99 |
-
"gelu_new": gelu_new,
|
100 |
-
"gelu_fast": gelu_fast,
|
101 |
-
"quick_gelu": quick_gelu,
|
102 |
-
"mish": mish,
|
103 |
-
"linear": linear_act,
|
104 |
-
"sigmoid": torch.sigmoid,
|
105 |
-
"squared_relu": squared_relu,
|
106 |
-
"squared_relu_xla": squared_relu_xla,
|
107 |
-
}
|
108 |
-
|
109 |
-
|
110 |
-
def get_activation(activation_string):
|
111 |
-
if activation_string in tranception_ACT2FN:
|
112 |
-
return tranception_ACT2FN[activation_string]
|
113 |
-
else:
|
114 |
-
raise KeyError(f"function {activation_string} not found in ACT2FN mapping {list(tranception_ACT2FN.keys())}")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
config.py
DELETED
@@ -1,36 +0,0 @@
|
|
1 |
-
from transformers import GPT2Config
|
2 |
-
|
3 |
-
class TranceptionConfig(GPT2Config):
|
4 |
-
"""
|
5 |
-
Config subclass for Tranception model architecture.
|
6 |
-
"""
|
7 |
-
def __init__(
|
8 |
-
self,
|
9 |
-
attention_mode="tranception",
|
10 |
-
position_embedding="grouped_alibi",
|
11 |
-
tokenizer=None,
|
12 |
-
retrieval_aggregation_mode=None,
|
13 |
-
retrieval_inference_weight=0.6,
|
14 |
-
MSA_filename=None,
|
15 |
-
MSA_weight_file_name=None,
|
16 |
-
MSA_start=None,
|
17 |
-
MSA_end=None,
|
18 |
-
full_protein_length=None,
|
19 |
-
clustal_omega_location=None,
|
20 |
-
scoring_window=None,
|
21 |
-
**kwargs
|
22 |
-
):
|
23 |
-
super().__init__(**kwargs)
|
24 |
-
self.model_type="tranception"
|
25 |
-
self.attention_mode=attention_mode
|
26 |
-
self.position_embedding=position_embedding
|
27 |
-
self.tokenizer = tokenizer
|
28 |
-
self.retrieval_aggregation_mode = retrieval_aggregation_mode
|
29 |
-
self.retrieval_inference_weight = retrieval_inference_weight
|
30 |
-
self.MSA_filename = MSA_filename
|
31 |
-
self.MSA_weight_file_name = MSA_weight_file_name
|
32 |
-
self.MSA_start=MSA_start
|
33 |
-
self.MSA_end=MSA_end
|
34 |
-
self.full_protein_length = full_protein_length
|
35 |
-
self.clustal_omega_location = clustal_omega_location
|
36 |
-
self.scoring_window=scoring_window
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
model_pytorch.py
DELETED
@@ -1,917 +0,0 @@
|
|
1 |
-
from dataclasses import dataclass
|
2 |
-
from typing import Optional, Tuple
|
3 |
-
import math
|
4 |
-
import os
|
5 |
-
import pandas as pd
|
6 |
-
|
7 |
-
import torch
|
8 |
-
from torch import nn
|
9 |
-
from torch.nn import CrossEntropyLoss, NLLLoss
|
10 |
-
import torch.nn.functional as F
|
11 |
-
from transformers import GPT2PreTrainedModel
|
12 |
-
|
13 |
-
from transformers.modeling_utils import (
|
14 |
-
Conv1D,
|
15 |
-
PreTrainedModel,
|
16 |
-
SequenceSummary,
|
17 |
-
find_pruneable_heads_and_indices,
|
18 |
-
prune_conv1d_layer,
|
19 |
-
)
|
20 |
-
from transformers.file_utils import (
|
21 |
-
ModelOutput,
|
22 |
-
add_code_sample_docstrings,
|
23 |
-
add_start_docstrings,
|
24 |
-
add_start_docstrings_to_model_forward,
|
25 |
-
replace_return_docstrings
|
26 |
-
)
|
27 |
-
from transformers.modeling_outputs import (
|
28 |
-
BaseModelOutputWithPastAndCrossAttentions,
|
29 |
-
CausalLMOutputWithCrossAttentions,
|
30 |
-
SequenceClassifierOutputWithPast,
|
31 |
-
TokenClassifierOutput
|
32 |
-
)
|
33 |
-
from transformers.utils.model_parallel_utils import assert_device_map, get_device_map
|
34 |
-
|
35 |
-
from tranception.activations import tranception_ACT2FN
|
36 |
-
from tranception.config import TranceptionConfig
|
37 |
-
from tranception.outputs import (
|
38 |
-
TranceptionCausalLMOutputWithCrossAttentions,
|
39 |
-
)
|
40 |
-
from tranception.utils import msa_utils
|
41 |
-
from tranception.utils import scoring_utils
|
42 |
-
|
43 |
-
def nanmean(v, *args, inplace=False, **kwargs):
|
44 |
-
if not inplace:
|
45 |
-
v = v.clone()
|
46 |
-
is_nan = torch.isnan(v)
|
47 |
-
v[is_nan] = 0
|
48 |
-
return v.sum(*args, **kwargs) / (~is_nan).float().sum(*args, **kwargs)
|
49 |
-
|
50 |
-
def get_slopes(n, mode="standard_alibi", verbose=False):
|
51 |
-
"""
|
52 |
-
Function to compute the m constant for each attention head. Code has been adapted from the official ALiBi codebase at:
|
53 |
-
https://github.com/ofirpress/attention_with_linear_biases/blob/master/fairseq/models/transformer.py
|
54 |
-
"""
|
55 |
-
def get_slopes_power_of_2(n):
|
56 |
-
start = (2**(-2**-(math.log2(n)-3)))
|
57 |
-
ratio = start
|
58 |
-
return [start*ratio**i for i in range(n)]
|
59 |
-
if mode=="grouped_alibi":
|
60 |
-
n = n // 4
|
61 |
-
if math.log2(n).is_integer():
|
62 |
-
result = get_slopes_power_of_2(n)
|
63 |
-
else:
|
64 |
-
#Workaround when the number of heads is not a power of 2
|
65 |
-
closest_power_of_2 = 2**math.floor(math.log2(n))
|
66 |
-
result = get_slopes_power_of_2(closest_power_of_2) + get_slopes(2*closest_power_of_2)[0::2][:n-closest_power_of_2]
|
67 |
-
if mode=="grouped_alibi":
|
68 |
-
result = result * 4
|
69 |
-
if verbose:
|
70 |
-
print("ALiBi slopes: {}".format(result))
|
71 |
-
return result
|
72 |
-
|
73 |
-
class SpatialDepthWiseConvolution(nn.Module):
|
74 |
-
def __init__(self, head_dim: int, kernel_size: int = 3):
|
75 |
-
super().__init__()
|
76 |
-
self.kernel_size = kernel_size
|
77 |
-
self.conv = nn.Conv1d(in_channels=head_dim, out_channels=head_dim, kernel_size=(kernel_size,), padding=(kernel_size - 1,), groups=head_dim)
|
78 |
-
|
79 |
-
def forward(self, x: torch.Tensor):
|
80 |
-
batch_size, heads, seq_len, head_dim = x.shape
|
81 |
-
x = x.permute(0, 1, 3, 2).contiguous()
|
82 |
-
x = x.view(batch_size * heads, head_dim, seq_len)
|
83 |
-
x = self.conv(x)
|
84 |
-
if self.kernel_size>1:
|
85 |
-
x = x[:, :, :-(self.kernel_size - 1)]
|
86 |
-
x = x.view(batch_size, heads, head_dim, seq_len)
|
87 |
-
x = x.permute(0, 1, 3, 2)
|
88 |
-
return x
|
89 |
-
|
90 |
-
class TranceptionBlockAttention(nn.Module):
|
91 |
-
def __init__(self, config, is_cross_attention=False, SDWC_kernel_size=None):
|
92 |
-
super().__init__()
|
93 |
-
|
94 |
-
max_positions = config.max_position_embeddings
|
95 |
-
self.register_buffer(
|
96 |
-
"bias",
|
97 |
-
torch.tril(torch.ones((max_positions, max_positions), dtype=torch.uint8)).view(
|
98 |
-
1, 1, max_positions, max_positions
|
99 |
-
),
|
100 |
-
)
|
101 |
-
self.register_buffer("masked_bias", torch.tensor(-1e4))
|
102 |
-
|
103 |
-
self.embed_dim = config.hidden_size
|
104 |
-
self.num_heads = config.num_attention_heads
|
105 |
-
self.head_dim = self.embed_dim // self.num_heads
|
106 |
-
self.split_size = self.embed_dim
|
107 |
-
if self.head_dim * self.num_heads != self.embed_dim:
|
108 |
-
raise ValueError(
|
109 |
-
f"`embed_dim` must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`: {self.num_heads})."
|
110 |
-
)
|
111 |
-
|
112 |
-
self.scale_attn_weights = config.scale_attn_weights
|
113 |
-
self.is_cross_attention = is_cross_attention
|
114 |
-
|
115 |
-
if self.is_cross_attention:
|
116 |
-
self.c_attn = Conv1D(2 * self.embed_dim, self.embed_dim)
|
117 |
-
self.q_attn = Conv1D(self.embed_dim, self.embed_dim)
|
118 |
-
else:
|
119 |
-
self.c_attn = Conv1D(3 * self.embed_dim, self.embed_dim)
|
120 |
-
self.c_proj = Conv1D(self.embed_dim, self.embed_dim)
|
121 |
-
|
122 |
-
self.attn_dropout = nn.Dropout(config.attn_pdrop)
|
123 |
-
self.resid_dropout = nn.Dropout(config.resid_pdrop)
|
124 |
-
|
125 |
-
self.pruned_heads = set()
|
126 |
-
|
127 |
-
self.attention_mode=config.attention_mode
|
128 |
-
|
129 |
-
if self.attention_mode=="tranception":
|
130 |
-
assert self.num_heads%4==0, "Invalid number of heads. Tranception requires the number of heads to be a multiple of 4."
|
131 |
-
self.num_heads_per_kernel_size = self.num_heads // 4
|
132 |
-
self.query_depthwiseconv = nn.ModuleDict()
|
133 |
-
self.key_depthwiseconv = nn.ModuleDict()
|
134 |
-
self.value_depthwiseconv = nn.ModuleDict()
|
135 |
-
for kernel_idx, kernel in enumerate([3,5,7]):
|
136 |
-
self.query_depthwiseconv[str(kernel_idx)] = SpatialDepthWiseConvolution(self.head_dim,kernel)
|
137 |
-
self.key_depthwiseconv[str(kernel_idx)] = SpatialDepthWiseConvolution(self.head_dim,kernel)
|
138 |
-
self.value_depthwiseconv[str(kernel_idx)] = SpatialDepthWiseConvolution(self.head_dim,kernel)
|
139 |
-
|
140 |
-
def prune_heads(self, heads):
|
141 |
-
if len(heads) == 0:
|
142 |
-
return
|
143 |
-
heads, index = find_pruneable_heads_and_indices(heads, self.num_heads, self.head_dim, self.pruned_heads)
|
144 |
-
index_attn = torch.cat([index, index + self.split_size, index + (2 * self.split_size)])
|
145 |
-
|
146 |
-
# Prune conv1d layers
|
147 |
-
self.c_attn = prune_conv1d_layer(self.c_attn, index_attn, dim=1)
|
148 |
-
self.c_proj = prune_conv1d_layer(self.c_proj, index, dim=0)
|
149 |
-
|
150 |
-
# Update hyper params
|
151 |
-
self.split_size = (self.split_size // self.num_heads) * (self.num_heads - len(heads))
|
152 |
-
self.num_heads = self.num_heads - len(heads)
|
153 |
-
self.pruned_heads = self.pruned_heads.union(heads)
|
154 |
-
|
155 |
-
def _attn(self, query, key, value, attention_mask=None, head_mask=None, alibi_bias=None):
|
156 |
-
attn_weights = torch.matmul(query, key.transpose(-1, -2))
|
157 |
-
|
158 |
-
if self.scale_attn_weights:
|
159 |
-
attn_weights = attn_weights / (float(value.size(-1)) ** 0.5)
|
160 |
-
|
161 |
-
if not self.is_cross_attention:
|
162 |
-
# if only "normal" attention layer implements causal mask
|
163 |
-
query_length, key_length = query.size(-2), key.size(-2)
|
164 |
-
causal_mask = self.bias[:, :, key_length - query_length : key_length, :key_length].bool()
|
165 |
-
attn_weights = torch.where(causal_mask, attn_weights, self.masked_bias.to(attn_weights.dtype))
|
166 |
-
|
167 |
-
if alibi_bias is not None:
|
168 |
-
attn_weights = attn_weights + alibi_bias[:,:,:attn_weights.size(-1)]
|
169 |
-
|
170 |
-
if attention_mask is not None:
|
171 |
-
# Apply the attention mask
|
172 |
-
attn_weights = attn_weights + attention_mask
|
173 |
-
|
174 |
-
attn_weights = nn.Softmax(dim=-1)(attn_weights)
|
175 |
-
attn_weights = self.attn_dropout(attn_weights)
|
176 |
-
|
177 |
-
# Mask heads if we want to
|
178 |
-
if head_mask is not None:
|
179 |
-
attn_weights = attn_weights * head_mask
|
180 |
-
|
181 |
-
attn_output = torch.matmul(attn_weights, value)
|
182 |
-
|
183 |
-
return attn_output, attn_weights
|
184 |
-
|
185 |
-
def _split_heads(self, tensor, num_heads, attn_head_size):
|
186 |
-
"""
|
187 |
-
Splits hidden_size dim into attn_head_size and num_heads
|
188 |
-
"""
|
189 |
-
new_shape = tensor.size()[:-1] + (num_heads, attn_head_size)
|
190 |
-
tensor = tensor.view(*new_shape)
|
191 |
-
return tensor.permute(0, 2, 1, 3) # (batch, head, seq_length, head_features)
|
192 |
-
|
193 |
-
def _merge_heads(self, tensor, num_heads, attn_head_size):
|
194 |
-
"""
|
195 |
-
Merges attn_head_size dim and num_attn_heads dim into hidden_size
|
196 |
-
"""
|
197 |
-
tensor = tensor.permute(0, 2, 1, 3).contiguous()
|
198 |
-
new_shape = tensor.size()[:-2] + (num_heads * attn_head_size,)
|
199 |
-
return tensor.view(new_shape)
|
200 |
-
|
201 |
-
def forward(
|
202 |
-
self,
|
203 |
-
hidden_states,
|
204 |
-
layer_past=None,
|
205 |
-
attention_mask=None,
|
206 |
-
head_mask=None,
|
207 |
-
encoder_hidden_states=None,
|
208 |
-
encoder_attention_mask=None,
|
209 |
-
use_cache=False,
|
210 |
-
output_attentions=False,
|
211 |
-
alibi_bias=None,
|
212 |
-
):
|
213 |
-
if encoder_hidden_states is not None:
|
214 |
-
if not hasattr(self, "q_attn"):
|
215 |
-
raise ValueError(
|
216 |
-
"If class is used as cross attention, the weights `q_attn` have to be defined. "
|
217 |
-
"Please make sure to instantiate class with `GPT2Attention(..., is_cross_attention=True)`."
|
218 |
-
)
|
219 |
-
|
220 |
-
query = self.q_attn(hidden_states)
|
221 |
-
key, value = self.c_attn(encoder_hidden_states).split(self.split_size, dim=2)
|
222 |
-
attention_mask = encoder_attention_mask
|
223 |
-
else:
|
224 |
-
query, key, value = self.c_attn(hidden_states).split(self.split_size, dim=2)
|
225 |
-
|
226 |
-
query = self._split_heads(query, self.num_heads, self.head_dim)
|
227 |
-
key = self._split_heads(key, self.num_heads, self.head_dim)
|
228 |
-
value = self._split_heads(value, self.num_heads, self.head_dim)
|
229 |
-
|
230 |
-
if layer_past is not None:
|
231 |
-
past_key, past_value = layer_past
|
232 |
-
key = torch.cat((past_key, key), dim=-2)
|
233 |
-
value = torch.cat((past_value, value), dim=-2)
|
234 |
-
|
235 |
-
if use_cache is True:
|
236 |
-
present = (key, value)
|
237 |
-
else:
|
238 |
-
present = None
|
239 |
-
|
240 |
-
if self.attention_mode=="tranception":
|
241 |
-
# We do not do anything on the first self.num_heads_per_kernel_size heads (kernel =1)
|
242 |
-
query_list=[query[:,:self.num_heads_per_kernel_size,:,:]]
|
243 |
-
key_list=[key[:,:self.num_heads_per_kernel_size,:,:]]
|
244 |
-
value_list=[value[:,:self.num_heads_per_kernel_size,:,:]]
|
245 |
-
for kernel_idx in range(3):
|
246 |
-
query_list.append(self.query_depthwiseconv[str(kernel_idx)](query[:,(kernel_idx+1)*self.num_heads_per_kernel_size:(kernel_idx+2)*self.num_heads_per_kernel_size,:,:]))
|
247 |
-
key_list.append(self.key_depthwiseconv[str(kernel_idx)](key[:,(kernel_idx+1)*self.num_heads_per_kernel_size:(kernel_idx+2)*self.num_heads_per_kernel_size,:,:]))
|
248 |
-
value_list.append(self.value_depthwiseconv[str(kernel_idx)](value[:,(kernel_idx+1)*self.num_heads_per_kernel_size:(kernel_idx+2)*self.num_heads_per_kernel_size,:,:]))
|
249 |
-
query=torch.cat(query_list, dim=1)
|
250 |
-
key=torch.cat(key_list, dim=1)
|
251 |
-
value=torch.cat(value_list, dim=1)
|
252 |
-
|
253 |
-
attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask, alibi_bias=alibi_bias)
|
254 |
-
|
255 |
-
attn_output = self._merge_heads(attn_output, self.num_heads, self.head_dim)
|
256 |
-
attn_output = self.c_proj(attn_output)
|
257 |
-
attn_output = self.resid_dropout(attn_output)
|
258 |
-
|
259 |
-
outputs = (attn_output, present)
|
260 |
-
if output_attentions:
|
261 |
-
outputs += (attn_weights,)
|
262 |
-
|
263 |
-
return outputs # a, present, (attentions)
|
264 |
-
|
265 |
-
class TranceptionBlockMLP(nn.Module):
|
266 |
-
def __init__(self, intermediate_size, config):
|
267 |
-
super().__init__()
|
268 |
-
embed_dim = config.hidden_size
|
269 |
-
self.c_fc = Conv1D(intermediate_size, embed_dim)
|
270 |
-
self.c_proj = Conv1D(embed_dim, intermediate_size)
|
271 |
-
self.act = tranception_ACT2FN[config.activation_function]
|
272 |
-
self.dropout = nn.Dropout(config.resid_pdrop)
|
273 |
-
|
274 |
-
def forward(self, hidden_states):
|
275 |
-
hidden_states = self.c_fc(hidden_states)
|
276 |
-
hidden_states = self.act(hidden_states)
|
277 |
-
hidden_states = self.c_proj(hidden_states)
|
278 |
-
hidden_states = self.dropout(hidden_states)
|
279 |
-
return hidden_states
|
280 |
-
|
281 |
-
class TranceptionBlock(nn.Module):
|
282 |
-
def __init__(self, config, SDWC_kernel_size=None):
|
283 |
-
super().__init__()
|
284 |
-
hidden_size = config.hidden_size
|
285 |
-
inner_dim = config.n_inner if config.n_inner is not None else 4 * hidden_size
|
286 |
-
|
287 |
-
self.ln_1 = nn.LayerNorm(hidden_size, eps=config.layer_norm_epsilon)
|
288 |
-
self.attn = TranceptionBlockAttention(config, SDWC_kernel_size=SDWC_kernel_size)
|
289 |
-
self.ln_2 = nn.LayerNorm(hidden_size, eps=config.layer_norm_epsilon)
|
290 |
-
|
291 |
-
if config.add_cross_attention:
|
292 |
-
self.crossattention = TranceptionBlockAttention(config, is_cross_attention=True, SDWC_kernel_size=SDWC_kernel_size)
|
293 |
-
self.ln_cross_attn = nn.LayerNorm(hidden_size, eps=config.layer_norm_epsilon)
|
294 |
-
|
295 |
-
self.mlp = TranceptionBlockMLP(inner_dim, config)
|
296 |
-
|
297 |
-
def forward(
|
298 |
-
self,
|
299 |
-
hidden_states,
|
300 |
-
layer_past=None,
|
301 |
-
attention_mask=None,
|
302 |
-
head_mask=None,
|
303 |
-
encoder_hidden_states=None,
|
304 |
-
encoder_attention_mask=None,
|
305 |
-
use_cache=False,
|
306 |
-
output_attentions=False,
|
307 |
-
alibi_bias=None,
|
308 |
-
):
|
309 |
-
residual = hidden_states
|
310 |
-
hidden_states = self.ln_1(hidden_states)
|
311 |
-
attn_outputs = self.attn(
|
312 |
-
hidden_states,
|
313 |
-
layer_past=layer_past,
|
314 |
-
attention_mask=attention_mask,
|
315 |
-
head_mask=head_mask,
|
316 |
-
use_cache=use_cache,
|
317 |
-
output_attentions=output_attentions,
|
318 |
-
alibi_bias=alibi_bias,
|
319 |
-
)
|
320 |
-
attn_output = attn_outputs[0] # output_attn: a, present, (attentions)
|
321 |
-
outputs = attn_outputs[1:]
|
322 |
-
# residual connection
|
323 |
-
hidden_states = attn_output + residual
|
324 |
-
|
325 |
-
if encoder_hidden_states is not None:
|
326 |
-
# add one self-attention block for cross-attention
|
327 |
-
if not hasattr(self, "crossattention"):
|
328 |
-
raise ValueError(
|
329 |
-
f"If `encoder_hidden_states` are passed, {self} has to be instantiated with "
|
330 |
-
"cross-attention layers by setting `config.add_cross_attention=True`"
|
331 |
-
)
|
332 |
-
residual = hidden_states
|
333 |
-
hidden_states = self.ln_cross_attn(hidden_states)
|
334 |
-
cross_attn_outputs = self.crossattention(
|
335 |
-
hidden_states,
|
336 |
-
attention_mask=attention_mask,
|
337 |
-
head_mask=head_mask,
|
338 |
-
encoder_hidden_states=encoder_hidden_states,
|
339 |
-
encoder_attention_mask=encoder_attention_mask,
|
340 |
-
output_attentions=output_attentions,
|
341 |
-
)
|
342 |
-
attn_output = cross_attn_outputs[0]
|
343 |
-
# residual connection
|
344 |
-
hidden_states = residual + attn_output
|
345 |
-
outputs = outputs + cross_attn_outputs[2:] # add cross attentions if we output attention weights
|
346 |
-
|
347 |
-
residual = hidden_states
|
348 |
-
hidden_states = self.ln_2(hidden_states)
|
349 |
-
|
350 |
-
feed_forward_hidden_states = self.mlp(hidden_states)
|
351 |
-
|
352 |
-
# residual connection
|
353 |
-
hidden_states = residual + feed_forward_hidden_states
|
354 |
-
|
355 |
-
if use_cache:
|
356 |
-
outputs = (hidden_states,) + outputs
|
357 |
-
else:
|
358 |
-
outputs = (hidden_states,) + outputs[1:]
|
359 |
-
|
360 |
-
return outputs # hidden_states, present, (attentions, cross_attentions)
|
361 |
-
|
362 |
-
class TranceptionModel(GPT2PreTrainedModel):
|
363 |
-
_keys_to_ignore_on_load_missing = ["attn.masked_bias"]
|
364 |
-
def __init__(self, config):
|
365 |
-
super().__init__(config)
|
366 |
-
|
367 |
-
self.embed_dim = config.hidden_size
|
368 |
-
self.wte = nn.Embedding(config.vocab_size, self.embed_dim)
|
369 |
-
self.position_embedding = config.position_embedding if hasattr(config, "position_embedding") else "learned"
|
370 |
-
if self.position_embedding=="learned":
|
371 |
-
self.wpe = nn.Embedding(config.max_position_embeddings, self.embed_dim)
|
372 |
-
self.alibi = None
|
373 |
-
elif self.position_embedding=="grouped_alibi":
|
374 |
-
maxpos = config.n_positions
|
375 |
-
attn_heads = config.n_head
|
376 |
-
self.slopes = torch.Tensor(get_slopes(attn_heads, mode=self.position_embedding))
|
377 |
-
#The softmax operation is invariant to translation, and bias functions used are always linear.
|
378 |
-
alibi = self.slopes.unsqueeze(1).unsqueeze(1) * torch.arange(maxpos).unsqueeze(0).unsqueeze(0).expand(attn_heads, -1, -1)
|
379 |
-
alibi = alibi.view(attn_heads, 1, maxpos)
|
380 |
-
self.register_buffer('alibi',alibi)
|
381 |
-
|
382 |
-
self.drop = nn.Dropout(config.embd_pdrop)
|
383 |
-
self.h = nn.ModuleList([TranceptionBlock(config) for _ in range(config.num_hidden_layers)])
|
384 |
-
self.ln_f = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_epsilon)
|
385 |
-
|
386 |
-
self.init_weights()
|
387 |
-
|
388 |
-
# Model parallel
|
389 |
-
self.model_parallel = False
|
390 |
-
self.device_map = None
|
391 |
-
self.gradient_checkpointing = False
|
392 |
-
|
393 |
-
def parallelize(self, device_map=None, num_cores=None):
|
394 |
-
self.device_map = (
|
395 |
-
get_device_map(len(self.h), range(torch.cuda.device_count())) if device_map is None else device_map
|
396 |
-
)
|
397 |
-
device_prefix="cuda:"
|
398 |
-
assert_device_map(self.device_map, len(self.h))
|
399 |
-
self.model_parallel = True
|
400 |
-
self.first_device = "cpu" if "cpu" in self.device_map.keys() else device_prefix + str(min(self.device_map.keys()))
|
401 |
-
self.last_device = device_prefix + str(max(self.device_map.keys()))
|
402 |
-
self.wte = self.wte.to(self.first_device)
|
403 |
-
if self.position_embedding=="learned":
|
404 |
-
self.wpe = self.wpe.to(self.first_device)
|
405 |
-
for k, v in self.device_map.items():
|
406 |
-
print("k,v :"+str(k)+","+str(v))
|
407 |
-
for block in v:
|
408 |
-
cuda_device = device_prefix + str(k)
|
409 |
-
self.h[block] = self.h[block].to(cuda_device)
|
410 |
-
self.ln_f = self.ln_f.to(self.last_device)
|
411 |
-
|
412 |
-
def deparallelize(self):
|
413 |
-
self.model_parallel = False
|
414 |
-
self.device_map = None
|
415 |
-
self.first_device = "cpu"
|
416 |
-
self.last_device = "cpu"
|
417 |
-
self.wte = self.wte.to("cpu")
|
418 |
-
if self.position_embedding=="learned":
|
419 |
-
self.wpe = self.wpe.to("cpu")
|
420 |
-
for index in range(len(self.h)):
|
421 |
-
self.h[index] = self.h[index].to("cpu")
|
422 |
-
self.ln_f = self.ln_f.to("cpu")
|
423 |
-
torch.cuda.empty_cache()
|
424 |
-
|
425 |
-
def get_input_embeddings(self):
|
426 |
-
return self.wte
|
427 |
-
|
428 |
-
def set_input_embeddings(self, new_embeddings):
|
429 |
-
self.wte = new_embeddings
|
430 |
-
|
431 |
-
def _prune_heads(self, heads_to_prune):
|
432 |
-
"""
|
433 |
-
Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
|
434 |
-
"""
|
435 |
-
for layer, heads in heads_to_prune.items():
|
436 |
-
self.h[layer].attn.prune_heads(heads)
|
437 |
-
|
438 |
-
def forward(
|
439 |
-
self,
|
440 |
-
input_ids=None,
|
441 |
-
past_key_values=None,
|
442 |
-
attention_mask=None,
|
443 |
-
token_type_ids=None,
|
444 |
-
position_ids=None,
|
445 |
-
head_mask=None,
|
446 |
-
inputs_embeds=None,
|
447 |
-
encoder_hidden_states=None,
|
448 |
-
encoder_attention_mask=None,
|
449 |
-
use_cache=None,
|
450 |
-
output_attentions=None,
|
451 |
-
output_hidden_states=None,
|
452 |
-
return_dict=None,
|
453 |
-
):
|
454 |
-
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
|
455 |
-
output_hidden_states = (
|
456 |
-
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
|
457 |
-
)
|
458 |
-
use_cache = use_cache if use_cache is not None else self.config.use_cache
|
459 |
-
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
460 |
-
|
461 |
-
if input_ids is not None and inputs_embeds is not None:
|
462 |
-
raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
|
463 |
-
elif input_ids is not None:
|
464 |
-
input_shape = input_ids.size()
|
465 |
-
input_ids = input_ids.view(-1, input_shape[-1])
|
466 |
-
batch_size = input_ids.shape[0]
|
467 |
-
elif inputs_embeds is not None:
|
468 |
-
input_shape = inputs_embeds.size()[:-1]
|
469 |
-
batch_size = inputs_embeds.shape[0]
|
470 |
-
else:
|
471 |
-
raise ValueError("You have to specify either input_ids or inputs_embeds")
|
472 |
-
|
473 |
-
device = input_ids.device if input_ids is not None else inputs_embeds.device
|
474 |
-
|
475 |
-
if token_type_ids is not None:
|
476 |
-
token_type_ids = token_type_ids.view(-1, input_shape[-1])
|
477 |
-
if position_ids is not None:
|
478 |
-
position_ids = position_ids.view(-1, input_shape[-1])
|
479 |
-
|
480 |
-
if past_key_values is None:
|
481 |
-
past_length = 0
|
482 |
-
past_key_values = tuple([None] * len(self.h))
|
483 |
-
else:
|
484 |
-
past_length = past_key_values[0][0].size(-2)
|
485 |
-
if position_ids is None:
|
486 |
-
position_ids = torch.arange(past_length, input_shape[-1] + past_length, dtype=torch.long, device=device)
|
487 |
-
position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1])
|
488 |
-
|
489 |
-
# GPT2Attention mask.
|
490 |
-
if attention_mask is not None:
|
491 |
-
if batch_size <= 0:
|
492 |
-
raise ValueError("batch_size has to be defined and > 0")
|
493 |
-
attention_mask = attention_mask.view(batch_size, -1)
|
494 |
-
# We create a 3D attention mask from a 2D tensor mask.
|
495 |
-
# Sizes are [batch_size, 1, 1, to_seq_length]
|
496 |
-
# So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
|
497 |
-
# this attention mask is more simple than the triangular masking of causal attention
|
498 |
-
# used in OpenAI GPT, we just need to prepare the broadcast dimension here.
|
499 |
-
attention_mask = attention_mask[:, None, None, :]
|
500 |
-
|
501 |
-
# Since attention_mask is 1.0 for positions we want to attend and 0.0 for
|
502 |
-
# masked positions, this operation will create a tensor which is 0.0 for
|
503 |
-
# positions we want to attend and -10000.0 for masked positions.
|
504 |
-
# Since we are adding it to the raw scores before the softmax, this is
|
505 |
-
# effectively the same as removing these entirely.
|
506 |
-
attention_mask = attention_mask.to(dtype=self.dtype) # fp16 compatibility
|
507 |
-
attention_mask = (1.0 - attention_mask) * -10000.0
|
508 |
-
|
509 |
-
# If a 2D ou 3D attention mask is provided for the cross-attention
|
510 |
-
# we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
|
511 |
-
if self.config.add_cross_attention and encoder_hidden_states is not None:
|
512 |
-
encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()
|
513 |
-
encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
|
514 |
-
if encoder_attention_mask is None:
|
515 |
-
encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)
|
516 |
-
encoder_attention_mask = self.invert_attention_mask(encoder_attention_mask)
|
517 |
-
else:
|
518 |
-
encoder_attention_mask = None
|
519 |
-
|
520 |
-
# Prepare head mask if needed
|
521 |
-
# 1.0 in head_mask indicate we keep the head
|
522 |
-
# attention_probs has shape bsz x n_heads x N x N
|
523 |
-
# head_mask has shape n_layer x batch x n_heads x N x N
|
524 |
-
head_mask = self.get_head_mask(head_mask, self.config.n_layer)
|
525 |
-
|
526 |
-
if inputs_embeds is None:
|
527 |
-
inputs_embeds = self.wte(input_ids)
|
528 |
-
if self.position_embedding=="learned":
|
529 |
-
position_embeds = self.wpe(position_ids)
|
530 |
-
hidden_states = inputs_embeds + position_embeds
|
531 |
-
else:
|
532 |
-
hidden_states = inputs_embeds
|
533 |
-
|
534 |
-
if token_type_ids is not None:
|
535 |
-
token_type_embeds = self.wte(token_type_ids)
|
536 |
-
hidden_states = hidden_states + token_type_embeds
|
537 |
-
|
538 |
-
hidden_states = self.drop(hidden_states)
|
539 |
-
|
540 |
-
output_shape = input_shape + (hidden_states.size(-1),)
|
541 |
-
|
542 |
-
presents = () if use_cache else None
|
543 |
-
all_self_attentions = () if output_attentions else None
|
544 |
-
all_cross_attentions = () if output_attentions and self.config.add_cross_attention else None
|
545 |
-
all_hidden_states = () if output_hidden_states else None
|
546 |
-
|
547 |
-
for i, (block, layer_past) in enumerate(zip(self.h, past_key_values)):
|
548 |
-
# Model parallel
|
549 |
-
if self.model_parallel:
|
550 |
-
torch.cuda.set_device(hidden_states.device)
|
551 |
-
# Ensure layer_past is on same device as hidden_states (might not be correct)
|
552 |
-
if layer_past is not None:
|
553 |
-
layer_past = tuple(past_state.to(hidden_states.device) for past_state in layer_past)
|
554 |
-
# Ensure that attention_mask is always on the same device as hidden_states
|
555 |
-
if attention_mask is not None:
|
556 |
-
attention_mask = attention_mask.to(hidden_states.device)
|
557 |
-
if isinstance(head_mask, torch.Tensor):
|
558 |
-
head_mask = head_mask.to(hidden_states.device)
|
559 |
-
if output_hidden_states:
|
560 |
-
all_hidden_states = all_hidden_states + (hidden_states,)
|
561 |
-
|
562 |
-
if self.gradient_checkpointing and self.training:
|
563 |
-
if use_cache:
|
564 |
-
logger.warning(
|
565 |
-
"`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
|
566 |
-
)
|
567 |
-
use_cache = False
|
568 |
-
|
569 |
-
def create_custom_forward(module):
|
570 |
-
def custom_forward(*inputs):
|
571 |
-
# None for past_key_value
|
572 |
-
return module(*inputs, use_cache, output_attentions)
|
573 |
-
|
574 |
-
return custom_forward
|
575 |
-
|
576 |
-
outputs = torch.utils.checkpoint.checkpoint(
|
577 |
-
create_custom_forward(block),
|
578 |
-
hidden_states,
|
579 |
-
None,
|
580 |
-
attention_mask,
|
581 |
-
head_mask[i],
|
582 |
-
encoder_hidden_states,
|
583 |
-
encoder_attention_mask,
|
584 |
-
)
|
585 |
-
else:
|
586 |
-
outputs = block(
|
587 |
-
hidden_states,
|
588 |
-
layer_past=layer_past,
|
589 |
-
attention_mask=attention_mask,
|
590 |
-
head_mask=head_mask[i],
|
591 |
-
encoder_hidden_states=encoder_hidden_states,
|
592 |
-
encoder_attention_mask=encoder_attention_mask,
|
593 |
-
use_cache=use_cache,
|
594 |
-
output_attentions=output_attentions,
|
595 |
-
alibi_bias=self.alibi if hasattr(self, "alibi") else None
|
596 |
-
)
|
597 |
-
|
598 |
-
hidden_states = outputs[0]
|
599 |
-
|
600 |
-
if use_cache is True:
|
601 |
-
presents = presents + (outputs[1],)
|
602 |
-
|
603 |
-
if output_attentions:
|
604 |
-
all_self_attentions = all_self_attentions + (outputs[2 if use_cache else 1],)
|
605 |
-
if self.config.add_cross_attention:
|
606 |
-
all_cross_attentions = all_cross_attentions + (outputs[3 if use_cache else 2],)
|
607 |
-
|
608 |
-
if self.model_parallel:
|
609 |
-
device_prefix="cuda:"
|
610 |
-
for k, v in self.device_map.items():
|
611 |
-
if i == v[-1] and device_prefix + str(k) != self.last_device:
|
612 |
-
hidden_states = hidden_states.to(device_prefix + str(k + 1))
|
613 |
-
|
614 |
-
hidden_states = self.ln_f(hidden_states)
|
615 |
-
|
616 |
-
hidden_states = hidden_states.view(*output_shape)
|
617 |
-
# Add last hidden state
|
618 |
-
if output_hidden_states:
|
619 |
-
all_hidden_states = all_hidden_states + (hidden_states,)
|
620 |
-
|
621 |
-
if not return_dict:
|
622 |
-
return tuple(
|
623 |
-
v
|
624 |
-
for v in [hidden_states, presents, all_hidden_states, all_self_attentions, all_cross_attentions, moe_loss]
|
625 |
-
if v is not None
|
626 |
-
)
|
627 |
-
|
628 |
-
return BaseModelOutputWithPastAndCrossAttentions(
|
629 |
-
last_hidden_state=hidden_states,
|
630 |
-
past_key_values=presents,
|
631 |
-
hidden_states=all_hidden_states,
|
632 |
-
attentions=all_self_attentions,
|
633 |
-
cross_attentions=all_cross_attentions,
|
634 |
-
)
|
635 |
-
|
636 |
-
class TranceptionLMHeadModel(GPT2PreTrainedModel):
|
637 |
-
_keys_to_ignore_on_load_missing = [r"attn.masked_bias", r"attn.bias", r"lm_head.weight"]
|
638 |
-
def __init__(self, config):
|
639 |
-
super().__init__(config)
|
640 |
-
self.transformer = TranceptionModel(config)
|
641 |
-
self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
|
642 |
-
self.config = config
|
643 |
-
|
644 |
-
self.init_weights()
|
645 |
-
|
646 |
-
self.default_model_device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
|
647 |
-
# Model parallel
|
648 |
-
self.model_parallel = False
|
649 |
-
self.device_map = None
|
650 |
-
|
651 |
-
self.retrieval_aggregation_mode = config.retrieval_aggregation_mode if hasattr(config, "retrieval_aggregation_mode") else None
|
652 |
-
if self.retrieval_aggregation_mode is not None:
|
653 |
-
print("Model leverages both autoregressive and retrieval inference")
|
654 |
-
self.MSA_filename = config.MSA_filename if hasattr(config, "MSA_filename") else False
|
655 |
-
self.MSA_folder = '/'.join(self.MSA_filename.split(os.sep)[:-1])
|
656 |
-
self.MSA_name = self.MSA_filename.split(os.sep)[-1]
|
657 |
-
self.retrieval_inference_weight_LR = config.retrieval_inference_weight if hasattr(config, "retrieval_inference_weight") else 0.6
|
658 |
-
self.retrieval_inference_weight_RL = config.retrieval_inference_weight if hasattr(config, "retrieval_inference_weight") else 0.6
|
659 |
-
self.MSA_start=config.MSA_start
|
660 |
-
self.MSA_end=config.MSA_end
|
661 |
-
self.full_protein_length = config.full_protein_length if hasattr(config, "full_protein_length") else -1
|
662 |
-
|
663 |
-
self.MSA_log_prior = torch.log(torch.tensor(
|
664 |
-
msa_utils.get_msa_prior(
|
665 |
-
MSA_data_file=self.MSA_filename,
|
666 |
-
MSA_weight_file_name=config.MSA_weight_file_name,
|
667 |
-
retrieval_aggregation_mode=self.retrieval_aggregation_mode,
|
668 |
-
MSA_start=self.MSA_start,
|
669 |
-
MSA_end=self.MSA_end,
|
670 |
-
len_target_seq=self.full_protein_length,
|
671 |
-
vocab=config.tokenizer.get_vocab(),
|
672 |
-
verbose=False
|
673 |
-
)
|
674 |
-
).float().to(self.default_model_device))
|
675 |
-
else:
|
676 |
-
print("Model only uses autoregressive inference")
|
677 |
-
|
678 |
-
def parallelize(self, device_map=None, num_cores=None, num_pipelines=1):
|
679 |
-
self.num_pipelines=num_pipelines
|
680 |
-
self.device_map = (
|
681 |
-
get_device_map(len(self.transformer.h), range(torch.cuda.device_count()))
|
682 |
-
if device_map is None
|
683 |
-
else device_map
|
684 |
-
)
|
685 |
-
assert_device_map(self.device_map, len(self.transformer.h))
|
686 |
-
self.transformer.parallelize(self.device_map, num_cores=num_cores)
|
687 |
-
self.lm_head = self.lm_head.to(self.transformer.first_device)
|
688 |
-
self.model_parallel = True
|
689 |
-
|
690 |
-
def deparallelize(self):
|
691 |
-
self.transformer.deparallelize()
|
692 |
-
self.transformer = self.transformer.to("cpu")
|
693 |
-
self.lm_head = self.lm_head.to("cpu")
|
694 |
-
self.model_parallel = False
|
695 |
-
torch.cuda.empty_cache()
|
696 |
-
|
697 |
-
def get_output_embeddings(self):
|
698 |
-
return self.lm_head
|
699 |
-
|
700 |
-
def set_output_embeddings(self, new_embeddings):
|
701 |
-
self.lm_head = new_embeddings
|
702 |
-
|
703 |
-
def prepare_inputs_for_generation(self, input_ids, past=None, **kwargs):
|
704 |
-
token_type_ids = kwargs.get("token_type_ids", None)
|
705 |
-
# only last token for inputs_ids if past is defined in kwargs
|
706 |
-
if past:
|
707 |
-
input_ids = input_ids[:, -1].unsqueeze(-1)
|
708 |
-
if token_type_ids is not None:
|
709 |
-
token_type_ids = token_type_ids[:, -1].unsqueeze(-1)
|
710 |
-
|
711 |
-
attention_mask = kwargs.get("attention_mask", None)
|
712 |
-
position_ids = kwargs.get("position_ids", None)
|
713 |
-
|
714 |
-
if attention_mask is not None and position_ids is None:
|
715 |
-
# create position_ids on the fly for batch generation
|
716 |
-
position_ids = attention_mask.long().cumsum(-1) - 1
|
717 |
-
position_ids.masked_fill_(attention_mask == 0, 1)
|
718 |
-
if past:
|
719 |
-
position_ids = position_ids[:, -1].unsqueeze(-1)
|
720 |
-
else:
|
721 |
-
position_ids = None
|
722 |
-
|
723 |
-
return {
|
724 |
-
"input_ids": input_ids,
|
725 |
-
"past_key_values": past,
|
726 |
-
"use_cache": kwargs.get("use_cache"),
|
727 |
-
"position_ids": position_ids,
|
728 |
-
"attention_mask": attention_mask,
|
729 |
-
"token_type_ids": token_type_ids,
|
730 |
-
"flip": kwargs.get("flip", None),
|
731 |
-
}
|
732 |
-
|
733 |
-
def forward(
|
734 |
-
self,
|
735 |
-
input_ids=None,
|
736 |
-
past_key_values=None,
|
737 |
-
attention_mask=None,
|
738 |
-
token_type_ids=None,
|
739 |
-
position_ids=None,
|
740 |
-
head_mask=None,
|
741 |
-
inputs_embeds=None,
|
742 |
-
encoder_hidden_states=None,
|
743 |
-
encoder_attention_mask=None,
|
744 |
-
labels=None,
|
745 |
-
use_cache=None,
|
746 |
-
output_attentions=None,
|
747 |
-
output_hidden_states=None,
|
748 |
-
return_dict=None,
|
749 |
-
flip=None,
|
750 |
-
start_slice=None,
|
751 |
-
end_slice=None,
|
752 |
-
full_raw_sequence=None,
|
753 |
-
):
|
754 |
-
r"""
|
755 |
-
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
|
756 |
-
Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set
|
757 |
-
``labels = input_ids`` Indices are selected in ``[-100, 0, ..., config.vocab_size]`` All labels set to
|
758 |
-
``-100`` are ignored (masked), the loss is only computed for labels in ``[0, ..., config.vocab_size]``
|
759 |
-
"""
|
760 |
-
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
761 |
-
|
762 |
-
transformer_outputs = self.transformer(
|
763 |
-
input_ids,
|
764 |
-
past_key_values=past_key_values,
|
765 |
-
attention_mask=attention_mask,
|
766 |
-
token_type_ids=token_type_ids,
|
767 |
-
position_ids=position_ids,
|
768 |
-
head_mask=head_mask,
|
769 |
-
inputs_embeds=inputs_embeds,
|
770 |
-
encoder_hidden_states=encoder_hidden_states,
|
771 |
-
encoder_attention_mask=encoder_attention_mask,
|
772 |
-
use_cache=use_cache,
|
773 |
-
output_attentions=output_attentions,
|
774 |
-
output_hidden_states=output_hidden_states,
|
775 |
-
return_dict=return_dict
|
776 |
-
)
|
777 |
-
hidden_states = transformer_outputs[0]
|
778 |
-
|
779 |
-
# Set device for model parallelism
|
780 |
-
if self.model_parallel:
|
781 |
-
torch.cuda.set_device(self.transformer.first_device)
|
782 |
-
hidden_states = hidden_states.to(self.lm_head.weight.device)
|
783 |
-
self.MSA_log_prior = self.MSA_log_prior.to(self.lm_head.weight.device)
|
784 |
-
|
785 |
-
lm_logits = self.lm_head(hidden_states)
|
786 |
-
|
787 |
-
loss = None
|
788 |
-
if labels is not None:
|
789 |
-
# Shift so that tokens < n predict n
|
790 |
-
shift_logits = lm_logits[..., :-1, :].contiguous()
|
791 |
-
shift_labels = labels[..., 1:].contiguous()
|
792 |
-
|
793 |
-
if self.retrieval_aggregation_mode is not None:
|
794 |
-
batch_size = input_ids.size(0)
|
795 |
-
|
796 |
-
if self.retrieval_aggregation_mode=="aggregate_indel":
|
797 |
-
assert batch_size==1, "Aggregate indel is only supported for batch size of 1"
|
798 |
-
truncated_sequence_text = full_raw_sequence[0][start_slice[0]:end_slice[0]]
|
799 |
-
if len(truncated_sequence_text)!=shift_logits.shape[1]-1: # shift_logits only has one extra token compared to truncated_sequence_text (the BOS token)
|
800 |
-
print("Tokenization error -- seq length: {} and shift_logits length - 1 : {}".format(len(full_raw_sequence),shift_logits.shape[1]-1))
|
801 |
-
MSA_log_prior, MSA_start, MSA_end = msa_utils.update_retrieved_MSA_log_prior_indel(self, self.MSA_log_prior, self.MSA_start, self.MSA_end, full_raw_sequence[0])
|
802 |
-
|
803 |
-
elif self.retrieval_aggregation_mode=="aggregate_substitution":
|
804 |
-
MSA_log_prior=self.MSA_log_prior
|
805 |
-
MSA_start=self.MSA_start
|
806 |
-
MSA_end=self.MSA_end
|
807 |
-
|
808 |
-
shift_log_probas = torch.log_softmax(shift_logits,dim=-1)
|
809 |
-
fused_shift_log_probas = shift_log_probas.clone()
|
810 |
-
if flip is None:
|
811 |
-
flip = torch.zeros(batch_size).to(fused_shift_log_probas.device)
|
812 |
-
flip = flip > 0
|
813 |
-
|
814 |
-
for seq_index in range(batch_size):
|
815 |
-
min_prior_slice = max(start_slice[seq_index], MSA_start)
|
816 |
-
max_prior_slice = min(end_slice[seq_index], MSA_end)
|
817 |
-
|
818 |
-
if max_prior_slice <= min_prior_slice:
|
819 |
-
print("Non overlapping region detected: min_prior_slice {} and max_prior_slice {}".format(min_prior_slice,max_prior_slice))
|
820 |
-
continue
|
821 |
-
|
822 |
-
slice_prior = MSA_log_prior[min_prior_slice:max_prior_slice,:].to(fused_shift_log_probas.device)
|
823 |
-
if flip[seq_index]:
|
824 |
-
slice_prior = torch.flip(slice_prior,dims=(0,))
|
825 |
-
min_logits_slice = max(0,end_slice[seq_index]-MSA_end)
|
826 |
-
max_logits_slice = min_logits_slice + (max_prior_slice-min_prior_slice)
|
827 |
-
fused_shift_log_probas[seq_index,min_logits_slice:max_logits_slice,:] = (1-self.retrieval_inference_weight_RL)*shift_log_probas[seq_index,min_logits_slice:max_logits_slice,:] + self.retrieval_inference_weight_RL*slice_prior
|
828 |
-
else:
|
829 |
-
min_logits_slice = max(0, MSA_start-start_slice[seq_index])
|
830 |
-
max_logits_slice = min_logits_slice + (max_prior_slice-min_prior_slice)
|
831 |
-
fused_shift_log_probas[seq_index,min_logits_slice:max_logits_slice,:] = (1-self.retrieval_inference_weight_LR)*shift_log_probas[seq_index,min_logits_slice:max_logits_slice,:] + self.retrieval_inference_weight_LR*slice_prior
|
832 |
-
|
833 |
-
if self.retrieval_aggregation_mode=="aggregate_indel":
|
834 |
-
try:
|
835 |
-
# If a given residue colume is an added zero-column, then we overwrite prior fusion and only predict based on the autoregressive transformer inference mode.
|
836 |
-
inserted_retrieval_positions = [True if slice_prior[i].sum()==0 else False for i in range(len(slice_prior))]+[True] #Last True is for the end of sentence token
|
837 |
-
fused_shift_log_probas[:,inserted_retrieval_positions,:]=shift_log_probas[:,inserted_retrieval_positions,:]
|
838 |
-
except:
|
839 |
-
print("Error when adding zero column(s) to account for insertion mutations.")
|
840 |
-
|
841 |
-
loss_fct = NLLLoss(reduction='none')
|
842 |
-
loss = loss_fct(input=fused_shift_log_probas.view(-1, fused_shift_log_probas.size(-1)), target=shift_labels.view(-1)).view(fused_shift_log_probas.shape[0],fused_shift_log_probas.shape[1])
|
843 |
-
mask = attention_mask[..., 1:].float()
|
844 |
-
mask[mask==0]=float('nan')
|
845 |
-
loss *= mask
|
846 |
-
loss = nanmean(loss, dim=1).mean()
|
847 |
-
else:
|
848 |
-
loss_fct = CrossEntropyLoss()
|
849 |
-
loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
|
850 |
-
fused_shift_log_probas = None
|
851 |
-
|
852 |
-
if not return_dict:
|
853 |
-
output = (lm_logits,) + transformer_outputs[1:]
|
854 |
-
return ((loss,) + output) if loss is not None else output
|
855 |
-
|
856 |
-
return TranceptionCausalLMOutputWithCrossAttentions(
|
857 |
-
loss=loss,
|
858 |
-
logits=lm_logits,
|
859 |
-
past_key_values=transformer_outputs.past_key_values,
|
860 |
-
hidden_states=transformer_outputs.hidden_states,
|
861 |
-
attentions=transformer_outputs.attentions,
|
862 |
-
cross_attentions=transformer_outputs.cross_attentions,
|
863 |
-
fused_shift_log_probas=fused_shift_log_probas
|
864 |
-
)
|
865 |
-
|
866 |
-
|
867 |
-
@staticmethod
|
868 |
-
def _reorder_cache(past: Tuple[Tuple[torch.Tensor]], beam_idx: torch.Tensor) -> Tuple[Tuple[torch.Tensor]]:
|
869 |
-
"""
|
870 |
-
This function is used to re-order the :obj:`past_key_values` cache if
|
871 |
-
:meth:`~transformers.PreTrainedModel.beam_search` or :meth:`~transformers.PreTrainedModel.beam_sample` is
|
872 |
-
called. This is required to match :obj:`past_key_values` with the correct beam_idx at every generation step.
|
873 |
-
"""
|
874 |
-
return tuple(
|
875 |
-
tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past)
|
876 |
-
for layer_past in past
|
877 |
-
)
|
878 |
-
|
879 |
-
def score_mutants(self, DMS_data, target_seq, scoring_mirror=True, batch_size_inference=10, num_workers=10, indel_mode=False):
|
880 |
-
"""
|
881 |
-
Method to score mutants in an input DMS file.
|
882 |
-
DMS_data: (dataframe) Dataframe containing the list of mutant triplets (substitutions) or full mutated sequences (indels) for scoring.
|
883 |
-
target_seq: (string) Full reference sequence (wild type) that is mutated in the DMS assay.
|
884 |
-
scoring_mirror: (bool) Whether to score mutated sequences from both directions (Left->Right and Right->Left).
|
885 |
-
batch_size_inference: (int) Batch size for scoring.
|
886 |
-
num_workers: (int) Number of workers to be used in the data loader.
|
887 |
-
indel_mode: (bool) Flag to be used when scoring insertions and deletions. Otherwise assumes substitutions.
|
888 |
-
"""
|
889 |
-
df = DMS_data.copy()
|
890 |
-
df['mutated_sequence'] = df['mutant'].apply(lambda x: scoring_utils.get_mutated_sequence(target_seq, x)) if not indel_mode else df['mutant']
|
891 |
-
if 'DMS_score' in df: del df['DMS_score']
|
892 |
-
if 'DMS_score_bin' in df: del df['DMS_score_bin']
|
893 |
-
df_left_to_right_slices = scoring_utils.get_sequence_slices(df, target_seq=target_seq, model_context_len = self.config.n_ctx - 2, indel_mode=indel_mode, scoring_window=self.config.scoring_window)
|
894 |
-
print("Scoring sequences from left to right")
|
895 |
-
scores_L_to_R = scoring_utils.get_tranception_scores_mutated_sequences(model=self, mutated_sequence_df=df_left_to_right_slices, batch_size_inference=batch_size_inference, score_var_name='avg_score_L_to_R', len_target_seq=len(target_seq), num_workers=num_workers, indel_mode=indel_mode)
|
896 |
-
if scoring_mirror:
|
897 |
-
print("Scoring sequences from right to left")
|
898 |
-
df_right_to_left_slices = df_left_to_right_slices.copy()
|
899 |
-
df_right_to_left_slices['mutated_sequence'] = df_right_to_left_slices['mutated_sequence'].apply(lambda x: x[::-1])
|
900 |
-
scores_R_to_L = scoring_utils.get_tranception_scores_mutated_sequences(model=self, mutated_sequence_df=df_right_to_left_slices, batch_size_inference=batch_size_inference, score_var_name='avg_score_R_to_L', len_target_seq=len(target_seq), num_workers=num_workers, reverse=True, indel_mode=indel_mode)
|
901 |
-
all_scores = pd.merge(scores_L_to_R, scores_R_to_L, on='mutant', how='left',suffixes=('','_R_to_L'))
|
902 |
-
all_scores['avg_score'] = (all_scores['avg_score_L_to_R'] + all_scores['avg_score_R_to_L']) / 2.0
|
903 |
-
else:
|
904 |
-
all_scores = scores_L_to_R
|
905 |
-
all_scores['avg_score'] = all_scores['avg_score_L_to_R']
|
906 |
-
return all_scores
|
907 |
-
|
908 |
-
def encode_batch(self, protein_sequence, sequence_name="mutated_sequence"):
|
909 |
-
"""
|
910 |
-
Method to process an input AA sequence batch (protein_sequence) and return a tokenized sequence (via the tokenizer associated to the model).
|
911 |
-
"""
|
912 |
-
protein_sequence[sequence_name] = scoring_utils.sequence_replace(sequences=protein_sequence[sequence_name], char_to_replace='X', char_replacements='ACDEFGHIKLMNPQRSTVWY')
|
913 |
-
protein_sequence[sequence_name] = scoring_utils.sequence_replace(sequences=protein_sequence[sequence_name], char_to_replace='B', char_replacements='DN')
|
914 |
-
protein_sequence[sequence_name] = scoring_utils.sequence_replace(sequences=protein_sequence[sequence_name], char_to_replace='J', char_replacements='IL')
|
915 |
-
protein_sequence[sequence_name] = scoring_utils.sequence_replace(sequences=protein_sequence[sequence_name], char_to_replace='Z', char_replacements='EQ')
|
916 |
-
return self.config.tokenizer(list(protein_sequence[sequence_name]), add_special_tokens=True, truncation=True, padding=True, max_length=self.config.n_ctx)
|
917 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
outputs.py
DELETED
@@ -1,48 +0,0 @@
|
|
1 |
-
from dataclasses import dataclass
|
2 |
-
from typing import Optional, Tuple
|
3 |
-
|
4 |
-
import torch
|
5 |
-
|
6 |
-
from transformers.file_utils import ModelOutput
|
7 |
-
|
8 |
-
@dataclass
|
9 |
-
class TranceptionCausalLMOutputWithCrossAttentions(ModelOutput):
|
10 |
-
"""
|
11 |
-
Class for Tranception causal language model (or autoregressive) outputs.
|
12 |
-
Args:
|
13 |
-
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
|
14 |
-
Language modeling loss (for next-token prediction).
|
15 |
-
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
|
16 |
-
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
|
17 |
-
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
|
18 |
-
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
|
19 |
-
shape `(batch_size, sequence_length, hidden_size)`.
|
20 |
-
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
|
21 |
-
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
|
22 |
-
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
|
23 |
-
sequence_length)`.
|
24 |
-
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
|
25 |
-
heads.
|
26 |
-
cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
|
27 |
-
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
|
28 |
-
sequence_length)`.
|
29 |
-
Cross attentions weights after the attention softmax, used to compute the weighted average in the
|
30 |
-
cross-attention heads.
|
31 |
-
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
|
32 |
-
Tuple of `torch.FloatTensor` tuples of length `config.n_layers`, with each tuple containing the cached key,
|
33 |
-
value states of the self-attention and the cross-attention layers if model is used in encoder-decoder
|
34 |
-
setting. Only relevant if `config.is_decoder = True`.
|
35 |
-
Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see
|
36 |
-
`past_key_values` input) to speed up sequential decoding.
|
37 |
-
fused_shift_log_probas (`torch.FloatTensor` of shape (batch_size, sequence_length, config.vocab_size), *optional*, returned when config.retrieval_aggregation_mode is not None.
|
38 |
-
log_probas for each residue position after aggregating autoregressive logits and retrieval logits.
|
39 |
-
|
40 |
-
"""
|
41 |
-
|
42 |
-
loss: Optional[torch.FloatTensor] = None
|
43 |
-
logits: torch.FloatTensor = None
|
44 |
-
past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None
|
45 |
-
hidden_states: Optional[Tuple[torch.FloatTensor]] = None
|
46 |
-
attentions: Optional[Tuple[torch.FloatTensor]] = None
|
47 |
-
cross_attentions: Optional[Tuple[torch.FloatTensor]] = None
|
48 |
-
fused_shift_log_probas: Optional[torch.FloatTensor] = None
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
utils/.DS_Store
DELETED
Binary file (6.15 kB)
|
|
utils/__init__.py
DELETED
@@ -1 +0,0 @@
|
|
1 |
-
from . import scoring_utils, msa_utils
|
|
|
|
utils/dms_utils.py
DELETED
@@ -1,26 +0,0 @@
|
|
1 |
-
import pandas as pd
|
2 |
-
import numpy as np
|
3 |
-
|
4 |
-
def DMS_file_cleanup(DMS_filename, target_seq, start_idx=1, end_idx=None, DMS_mutant_column='mutant', DMS_phenotype_name='score', DMS_directionality=1, AA_vocab = "ACDEFGHIKLMNPQRSTVWY"):
|
5 |
-
"""
|
6 |
-
Function to process the raw DMS assay data (eg., removing invalid mutants, aggregate silent mutations)
|
7 |
-
"""
|
8 |
-
DMS_data = pd.read_csv(DMS_filename, low_memory=False)
|
9 |
-
end_idx = start_idx + len(target_seq) - 1 if end_idx is None else end_idx
|
10 |
-
DMS_data['mutant'] = DMS_data[DMS_mutant_column]
|
11 |
-
|
12 |
-
DMS_data=DMS_data[DMS_data['mutant'].notnull()].copy()
|
13 |
-
DMS_data=DMS_data[DMS_data['mutant'].apply(lambda x: all([len(y)>=3 for y in x.split(":")]))].copy() #Mutant triplets should have at least 3 or more characters
|
14 |
-
DMS_data=DMS_data[DMS_data['mutant'].apply(lambda x: all([(y[0] in AA_vocab) and (y[1:-1].isnumeric()) and (y[-1] in AA_vocab) for y in x.split(":")]))].copy()
|
15 |
-
DMS_data=DMS_data[DMS_data['mutant'].apply(lambda x: all([int(y[1:-1])-start_idx >=0 and int(y[1:-1]) <= end_idx for y in x.split(":")]))].copy()
|
16 |
-
DMS_data=DMS_data[DMS_data['mutant'].apply(lambda x: all([y[0]==target_seq[int(y[1:-1])-start_idx] for y in x.split(":")]))].copy()
|
17 |
-
|
18 |
-
DMS_data[DMS_phenotype_name]=pd.to_numeric(DMS_data[DMS_phenotype_name],errors='coerce')
|
19 |
-
DMS_data=DMS_data[np.isfinite(DMS_data[DMS_phenotype_name])]
|
20 |
-
DMS_data.dropna(subset = [DMS_phenotype_name], inplace=True)
|
21 |
-
DMS_data['DMS_score'] = DMS_data[DMS_phenotype_name] * DMS_directionality
|
22 |
-
DMS_data=DMS_data[['mutant','DMS_score']]
|
23 |
-
DMS_data = DMS_data.groupby('mutant').mean().reset_index()
|
24 |
-
|
25 |
-
return DMS_data
|
26 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
utils/msa_utils.py
DELETED
@@ -1,361 +0,0 @@
|
|
1 |
-
import numpy as np
|
2 |
-
import pandas as pd
|
3 |
-
from collections import defaultdict
|
4 |
-
import random
|
5 |
-
import os
|
6 |
-
import torch
|
7 |
-
from Bio.Align.Applications import ClustalOmegaCommandline
|
8 |
-
|
9 |
-
def filter_msa(msa_data, num_sequences_kept=3):
|
10 |
-
"""
|
11 |
-
Helper function to filter an input MSA msa_data (obtained via process_msa_data) and keep only num_sequences_kept aligned sequences.
|
12 |
-
If the MSA already has fewer sequences than num_sequences_kept, we keep the MSA as is.
|
13 |
-
If filtering, we always keep the first sequence of the MSA (ie. the wild type) by default.
|
14 |
-
Sampling is done without replacement.
|
15 |
-
"""
|
16 |
-
if len(list(msa_data.keys())) <= num_sequences_kept:
|
17 |
-
return msa_data
|
18 |
-
filtered_msa = {}
|
19 |
-
wt_name = next(iter(msa_data))
|
20 |
-
filtered_msa[wt_name] = msa_data[wt_name]
|
21 |
-
del msa_data[wt_name]
|
22 |
-
sequence_names = list(msa_data.keys())
|
23 |
-
sequence_names_sampled = random.sample(sequence_names,k=num_sequences_kept-1)
|
24 |
-
for seq in sequence_names_sampled:
|
25 |
-
filtered_msa[seq] = msa_data[seq]
|
26 |
-
return filtered_msa
|
27 |
-
|
28 |
-
def process_msa_data(MSA_data_file):
|
29 |
-
"""
|
30 |
-
Helper function that takes as input a path to a MSA file (expects a2m format) and returns a dict mapping sequence ID to the corresponding AA sequence.
|
31 |
-
"""
|
32 |
-
msa_data = defaultdict(str)
|
33 |
-
sequence_name = ""
|
34 |
-
with open(MSA_data_file, "r") as msa_file:
|
35 |
-
for i, line in enumerate(msa_file):
|
36 |
-
line = line.rstrip()
|
37 |
-
if line.startswith(">"):
|
38 |
-
sequence_name = line
|
39 |
-
else:
|
40 |
-
msa_data[sequence_name] += line.upper()
|
41 |
-
return msa_data
|
42 |
-
|
43 |
-
def get_one_hot_sequences_dict(msa_data,MSA_start,MSA_end,vocab):
|
44 |
-
vocab_size = len(vocab.keys())
|
45 |
-
num_sequences_msa = len(msa_data.keys())
|
46 |
-
one_hots = np.zeros((num_sequences_msa,MSA_end-MSA_start,vocab_size))
|
47 |
-
for i,seq_name in enumerate(msa_data.keys()):
|
48 |
-
sequence = msa_data[seq_name]
|
49 |
-
for j,letter in enumerate(sequence):
|
50 |
-
if letter in vocab:
|
51 |
-
k = vocab[letter]
|
52 |
-
one_hots[i,j,k] = 1.0
|
53 |
-
return one_hots
|
54 |
-
|
55 |
-
def one_hot(sequence_string,vocab):
|
56 |
-
one_hots = np.zeros((len(sequence_string),len(vocab.keys())))
|
57 |
-
for j,letter in enumerate(sequence_string):
|
58 |
-
if letter in vocab:
|
59 |
-
k = vocab[letter]
|
60 |
-
one_hots[j,k] = 1.0
|
61 |
-
return one_hots.flatten()
|
62 |
-
|
63 |
-
def get_msa_prior(MSA_data_file, MSA_weight_file_name, MSA_start, MSA_end, len_target_seq, vocab, retrieval_aggregation_mode="aggregate_substitution", filter_MSA=True, verbose=False):
|
64 |
-
"""
|
65 |
-
Function to enable retrieval inference mode, via computation of (weighted) pseudocounts of AAs at each position of the retrieved MSA.
|
66 |
-
MSA_data_file: (string) path to MSA file (expects a2m format).
|
67 |
-
MSA_weight_file_name: (string) path to sequence weights in MSA.
|
68 |
-
MSA_start: (int) Sequence position that the MSA starts at (1-indexing).
|
69 |
-
MSA_end: (int) Sequence position that the MSA ends at (1-indexing).
|
70 |
-
len_target_seq: (int) Full length of sequence to be scored.
|
71 |
-
vocab: (dict) Vocabulary of the tokenizer.
|
72 |
-
retrieval_aggregation_mode: (string) Mode for retrieval inference (aggregate_substitution Vs aggregate_indel). If None, places a uniform prior over each token.
|
73 |
-
filter_MSA: (bool) Whether to filter out sequences with very low hamming similarity (< 0.2) to the reference sequence in the MSA (first sequence).
|
74 |
-
verbose: (bool) Whether to print to the console processing details along the way.
|
75 |
-
"""
|
76 |
-
msa_data = process_msa_data(MSA_data_file)
|
77 |
-
vocab_size = len(vocab.keys())
|
78 |
-
if verbose: print("Target seq len is {}, MSA length is {}, start position is {}, end position is {} and vocab size is {}".format(len_target_seq,MSA_end-MSA_start,MSA_start,MSA_end,vocab_size))
|
79 |
-
|
80 |
-
if filter_MSA:
|
81 |
-
if verbose: print("Num sequences in MSA pre filtering: {}".format(len(msa_data.keys())))
|
82 |
-
list_sequence_names = list(msa_data.keys())
|
83 |
-
focus_sequence_name = list(msa_data.keys())[0]
|
84 |
-
ref_sequence_hot = one_hot(msa_data[focus_sequence_name],vocab)
|
85 |
-
for sequence_name in list_sequence_names:
|
86 |
-
seq_hot = one_hot(msa_data[sequence_name],vocab)
|
87 |
-
hamming_similarity_seq_ref = np.dot(ref_sequence_hot,seq_hot) / np.dot(ref_sequence_hot,ref_sequence_hot)
|
88 |
-
if hamming_similarity_seq_ref < 0.2:
|
89 |
-
del msa_data[sequence_name]
|
90 |
-
if verbose: print("Num sequences in MSA post filtering: {}".format(len(msa_data.keys())))
|
91 |
-
|
92 |
-
if MSA_weight_file_name is not None:
|
93 |
-
if verbose: print("Using weights in {} for sequences in MSA.".format(MSA_weight_file_name))
|
94 |
-
assert os.path.exists(MSA_weight_file_name), "Weights file not located on disk."
|
95 |
-
MSA_EVE = MSA_processing(
|
96 |
-
MSA_location=MSA_data_file,
|
97 |
-
use_weights=True,
|
98 |
-
weights_location=MSA_weight_file_name
|
99 |
-
)
|
100 |
-
#We scan through all sequences to see if we have a weight for them as per EVE pre-processing. We drop them otherwise.
|
101 |
-
dropped_sequences=0
|
102 |
-
list_sequence_names = list(msa_data.keys())
|
103 |
-
MSA_weight=[]
|
104 |
-
for sequence_name in list_sequence_names:
|
105 |
-
if sequence_name not in MSA_EVE.seq_name_to_sequence:
|
106 |
-
dropped_sequences +=1
|
107 |
-
del msa_data[sequence_name]
|
108 |
-
else:
|
109 |
-
MSA_weight.append(MSA_EVE.seq_name_to_weight[sequence_name])
|
110 |
-
if verbose: print("Dropped {} sequences from MSA due to absent sequence weights".format(dropped_sequences))
|
111 |
-
else:
|
112 |
-
MSA_weight = [1] * len(list(msa_data.keys()))
|
113 |
-
|
114 |
-
if retrieval_aggregation_mode=="aggregate_substitution" or retrieval_aggregation_mode=="aggregate_indel":
|
115 |
-
one_hots = get_one_hot_sequences_dict(msa_data,MSA_start,MSA_end,vocab)
|
116 |
-
MSA_weight = np.expand_dims(np.array(MSA_weight),axis=(1,2))
|
117 |
-
base_rate = 1e-5
|
118 |
-
base_rates = np.ones_like(one_hots) * base_rate
|
119 |
-
weighted_one_hots = (one_hots + base_rates) * MSA_weight
|
120 |
-
MSA_weight_norm_counts = weighted_one_hots.sum(axis=-1).sum(axis=0)
|
121 |
-
MSA_weight_norm_counts = np.tile(MSA_weight_norm_counts.reshape(-1,1), (1,vocab_size))
|
122 |
-
one_hots_avg = weighted_one_hots.sum(axis=0) / MSA_weight_norm_counts
|
123 |
-
msa_prior = np.zeros((len_target_seq,vocab_size))
|
124 |
-
msa_prior[MSA_start:MSA_end,:]=one_hots_avg
|
125 |
-
else:
|
126 |
-
msa_prior = np.ones((len_target_seq,vocab_size)) / vocab_size
|
127 |
-
|
128 |
-
if verbose:
|
129 |
-
for idx, position in enumerate(msa_prior):
|
130 |
-
if len(position)!=25:
|
131 |
-
print("Size error")
|
132 |
-
if not round(position.sum(),2)==1.0:
|
133 |
-
print("Position at index {} does not add up to 1: {}".format(idx, position.sum()))
|
134 |
-
|
135 |
-
return msa_prior
|
136 |
-
|
137 |
-
|
138 |
-
def update_retrieved_MSA_log_prior_indel(model, MSA_log_prior, MSA_start, MSA_end, full_raw_sequence):
|
139 |
-
"""
|
140 |
-
Function to process MSA when scoring indels.
|
141 |
-
To identify positions to add / remove in the retrieved MSA, we append and align the sequence to be scored to the original MSA for that protein family with Clustal Omega.
|
142 |
-
If the original MSA is relatively deep (over 100k sequences), we sample (by default) 100k rows at random from that MSA to speed computations.
|
143 |
-
MSA sampling is performed only once (for the first sequence to be scored). Subsequent scoring use the same MSA sample.
|
144 |
-
"""
|
145 |
-
if not os.path.isdir(model.MSA_folder + os.sep + "Sampled"):
|
146 |
-
os.mkdir(model.MSA_folder + os.sep + "Sampled")
|
147 |
-
sampled_MSA_location = model.MSA_folder + os.sep + "Sampled" + os.sep + "Sampled_" + model.MSA_filename.split(os.sep)[-1]
|
148 |
-
|
149 |
-
if not os.path.exists(sampled_MSA_location):
|
150 |
-
msa_data = process_msa_data(model.MSA_filename)
|
151 |
-
msa_data_sampled = filter_msa(msa_data, num_sequences_kept=100000) #If MSA has less than 100k sequences, the sample is identical to original MSA
|
152 |
-
with open(sampled_MSA_location, 'w') as sampled_write_location:
|
153 |
-
for index, key in enumerate(msa_data_sampled):
|
154 |
-
key_name = ">REFERENCE_SEQUENCE" if index==0 else key
|
155 |
-
msa_data_sampled[key] = msa_data_sampled[key].upper()
|
156 |
-
msa_data_sampled[key] = msa_data_sampled[key].replace(".","-")
|
157 |
-
sampled_write_location.write(key_name+"\n"+"\n".join([msa_data_sampled[key][i:i+80] for i in range(0, len(msa_data_sampled[key]), 80)])+"\n")
|
158 |
-
|
159 |
-
seq_to_align_location = model.MSA_folder + os.sep + "Sampled" + os.sep + "Seq_to_align_" + model.MSA_filename.split(os.sep)[-1]
|
160 |
-
sequence_text_split = [full_raw_sequence[i:i+80] for i in range(0, len(full_raw_sequence), 80)]
|
161 |
-
sequence_text_split_split_join = "\n".join([">SEQ_TO_SCORE"]+sequence_text_split)
|
162 |
-
os.system("echo '"+sequence_text_split_split_join+"' > "+seq_to_align_location)
|
163 |
-
|
164 |
-
expanded_MSA_location = model.MSA_folder + os.sep + "Sampled" + os.sep + "Expanded_" + model.MSA_filename.split(os.sep)[-1]
|
165 |
-
clustalw_cline = ClustalOmegaCommandline(cmd=model.config.clustal_omega_location,
|
166 |
-
profile1=sampled_MSA_location,
|
167 |
-
profile2=seq_to_align_location,
|
168 |
-
outfile=expanded_MSA_location,
|
169 |
-
force=True)
|
170 |
-
stdout, stderr = clustalw_cline()
|
171 |
-
msa_data = process_msa_data(expanded_MSA_location)
|
172 |
-
aligned_seqA, aligned_seqB = msa_data[">SEQ_TO_SCORE"], msa_data[">REFERENCE_SEQUENCE"]
|
173 |
-
try:
|
174 |
-
keep_column=[]
|
175 |
-
for column_index_pairwise_alignment in range(len(aligned_seqA)):
|
176 |
-
if aligned_seqA[column_index_pairwise_alignment]=="-" and aligned_seqB[column_index_pairwise_alignment]=="-":
|
177 |
-
continue
|
178 |
-
elif aligned_seqA[column_index_pairwise_alignment]=="-":
|
179 |
-
keep_column.append(False)
|
180 |
-
elif aligned_seqB[column_index_pairwise_alignment]=="-":
|
181 |
-
MSA_log_prior=torch.cat((MSA_log_prior[:column_index_pairwise_alignment], torch.zeros(MSA_log_prior.shape[1]).view(1,-1).cuda(), MSA_log_prior[column_index_pairwise_alignment:]),dim=0)
|
182 |
-
keep_column.append(True) #keep the zero column we just added
|
183 |
-
else:
|
184 |
-
keep_column.append(True)
|
185 |
-
MSA_log_prior = MSA_log_prior[keep_column]
|
186 |
-
MSA_end = MSA_start + len(MSA_log_prior)
|
187 |
-
except:
|
188 |
-
print("Error when processing the following alignment: {}".format(expanded_MSA_location))
|
189 |
-
return MSA_log_prior, MSA_start, MSA_end
|
190 |
-
|
191 |
-
class MSA_processing:
|
192 |
-
def __init__(self,
|
193 |
-
MSA_location="",
|
194 |
-
theta=0.2,
|
195 |
-
use_weights=True,
|
196 |
-
weights_location="./data/weights",
|
197 |
-
preprocess_MSA=True,
|
198 |
-
threshold_sequence_frac_gaps=0.5,
|
199 |
-
threshold_focus_cols_frac_gaps=0.3,
|
200 |
-
remove_sequences_with_indeterminate_AA_in_focus_cols=True
|
201 |
-
):
|
202 |
-
|
203 |
-
"""
|
204 |
-
This MSA_processing class is directly borrowed from the EVE codebase: https://github.com/OATML-Markslab/EVE
|
205 |
-
|
206 |
-
Parameters:
|
207 |
-
- msa_location: (path) Location of the MSA data. Constraints on input MSA format:
|
208 |
-
- focus_sequence is the first one in the MSA data
|
209 |
-
- first line is structured as follows: ">focus_seq_name/start_pos-end_pos" (e.g., >SPIKE_SARS2/310-550)
|
210 |
-
- corespondding sequence data located on following line(s)
|
211 |
-
- then all other sequences follow with ">name" on first line, corresponding data on subsequent lines
|
212 |
-
- theta: (float) Sequence weighting hyperparameter. Generally: Prokaryotic and eukaryotic families = 0.2; Viruses = 0.01
|
213 |
-
- use_weights: (bool) If False, sets all sequence weights to 1. If True, checks weights_location -- if non empty uses that;
|
214 |
-
otherwise compute weights from scratch and store them at weights_location
|
215 |
-
- weights_location: (path) Location to load from/save to the sequence weights
|
216 |
-
- preprocess_MSA: (bool) performs pre-processing of MSA to remove short fragments and positions that are not well covered.
|
217 |
-
- threshold_sequence_frac_gaps: (float, between 0 and 1) Threshold value to define fragments
|
218 |
-
- sequences with a fraction of gap characters above threshold_sequence_frac_gaps are removed
|
219 |
-
- default is set to 0.5 (i.e., fragments with 50% or more gaps are removed)
|
220 |
-
- threshold_focus_cols_frac_gaps: (float, between 0 and 1) Threshold value to define focus columns
|
221 |
-
- positions with a fraction of gap characters above threshold_focus_cols_pct_gaps will be set to lower case (and not included in the focus_cols)
|
222 |
-
- default is set to 0.3 (i.e., focus positions are the ones with 30% of gaps or less, i.e., 70% or more residue occupancy)
|
223 |
-
- remove_sequences_with_indeterminate_AA_in_focus_cols: (bool) Remove all sequences that have indeterminate AA (e.g., B, J, X, Z) at focus positions of the wild type
|
224 |
-
"""
|
225 |
-
np.random.seed(2021)
|
226 |
-
self.MSA_location = MSA_location
|
227 |
-
self.weights_location = weights_location
|
228 |
-
self.theta = theta
|
229 |
-
self.alphabet = "ACDEFGHIKLMNPQRSTVWY"
|
230 |
-
self.use_weights = use_weights
|
231 |
-
self.preprocess_MSA = preprocess_MSA
|
232 |
-
self.threshold_sequence_frac_gaps = threshold_sequence_frac_gaps
|
233 |
-
self.threshold_focus_cols_frac_gaps = threshold_focus_cols_frac_gaps
|
234 |
-
self.remove_sequences_with_indeterminate_AA_in_focus_cols = remove_sequences_with_indeterminate_AA_in_focus_cols
|
235 |
-
|
236 |
-
self.gen_alignment()
|
237 |
-
|
238 |
-
def gen_alignment(self, verbose=False):
|
239 |
-
""" Read training alignment and store basics in class instance """
|
240 |
-
self.aa_dict = {}
|
241 |
-
for i,aa in enumerate(self.alphabet):
|
242 |
-
self.aa_dict[aa] = i
|
243 |
-
|
244 |
-
self.seq_name_to_sequence = defaultdict(str)
|
245 |
-
name = ""
|
246 |
-
with open(self.MSA_location, "r") as msa_data:
|
247 |
-
for i, line in enumerate(msa_data):
|
248 |
-
line = line.rstrip()
|
249 |
-
if line.startswith(">"):
|
250 |
-
name = line
|
251 |
-
if i==0:
|
252 |
-
self.focus_seq_name = name
|
253 |
-
else:
|
254 |
-
self.seq_name_to_sequence[name] += line
|
255 |
-
|
256 |
-
|
257 |
-
## MSA pre-processing to remove inadequate columns and sequences
|
258 |
-
if self.preprocess_MSA:
|
259 |
-
msa_df = pd.DataFrame.from_dict(self.seq_name_to_sequence, orient='index', columns=['sequence'])
|
260 |
-
# Data clean up
|
261 |
-
msa_df.sequence = msa_df.sequence.apply(lambda x: x.replace(".","-")).apply(lambda x: ''.join([aa.upper() for aa in x]))
|
262 |
-
# Remove columns that would be gaps in the wild type
|
263 |
-
non_gap_wt_cols = [aa!='-' for aa in msa_df.sequence[self.focus_seq_name]]
|
264 |
-
msa_df['sequence'] = msa_df['sequence'].apply(lambda x: ''.join([aa for aa,non_gap_ind in zip(x, non_gap_wt_cols) if non_gap_ind]))
|
265 |
-
assert 0.0 <= self.threshold_sequence_frac_gaps <= 1.0,"Invalid fragment filtering parameter"
|
266 |
-
assert 0.0 <= self.threshold_focus_cols_frac_gaps <= 1.0,"Invalid focus position filtering parameter"
|
267 |
-
msa_array = np.array([list(seq) for seq in msa_df.sequence])
|
268 |
-
gaps_array = np.array(list(map(lambda seq: [aa=='-' for aa in seq], msa_array)))
|
269 |
-
# Identify fragments with too many gaps
|
270 |
-
seq_gaps_frac = gaps_array.mean(axis=1)
|
271 |
-
seq_below_threshold = seq_gaps_frac <= self.threshold_sequence_frac_gaps
|
272 |
-
if verbose: print("Proportion of sequences dropped due to fraction of gaps: "+str(round(float(1 - seq_below_threshold.sum()/seq_below_threshold.shape)*100,2))+"%")
|
273 |
-
# Identify focus columns
|
274 |
-
columns_gaps_frac = gaps_array[seq_below_threshold].mean(axis=0)
|
275 |
-
index_cols_below_threshold = columns_gaps_frac <= self.threshold_focus_cols_frac_gaps
|
276 |
-
if verbose: print("Proportion of non-focus columns removed: "+str(round(float(1 - index_cols_below_threshold.sum()/index_cols_below_threshold.shape)*100,2))+"%")
|
277 |
-
# Lower case non focus cols and filter fragment sequences
|
278 |
-
msa_df['sequence'] = msa_df['sequence'].apply(lambda x: ''.join([aa.upper() if upper_case_ind else aa.lower() for aa, upper_case_ind in zip(x, index_cols_below_threshold)]))
|
279 |
-
msa_df = msa_df[seq_below_threshold]
|
280 |
-
# Overwrite seq_name_to_sequence with clean version
|
281 |
-
self.seq_name_to_sequence = defaultdict(str)
|
282 |
-
for seq_idx in range(len(msa_df['sequence'])):
|
283 |
-
self.seq_name_to_sequence[msa_df.index[seq_idx]] = msa_df.sequence[seq_idx]
|
284 |
-
|
285 |
-
self.focus_seq = self.seq_name_to_sequence[self.focus_seq_name]
|
286 |
-
self.focus_cols = [ix for ix, s in enumerate(self.focus_seq) if s == s.upper() and s!='-']
|
287 |
-
self.focus_seq_trimmed = [self.focus_seq[ix] for ix in self.focus_cols]
|
288 |
-
self.seq_len = len(self.focus_cols)
|
289 |
-
self.alphabet_size = len(self.alphabet)
|
290 |
-
|
291 |
-
# Connect local sequence index with uniprot index (index shift inferred from 1st row of MSA)
|
292 |
-
focus_loc = self.focus_seq_name.split("/")[-1]
|
293 |
-
start,stop = focus_loc.split("-")
|
294 |
-
self.focus_start_loc = int(start)
|
295 |
-
self.focus_stop_loc = int(stop)
|
296 |
-
self.uniprot_focus_col_to_wt_aa_dict \
|
297 |
-
= {idx_col+int(start):self.focus_seq[idx_col] for idx_col in self.focus_cols}
|
298 |
-
self.uniprot_focus_col_to_focus_idx \
|
299 |
-
= {idx_col+int(start):idx_col for idx_col in self.focus_cols}
|
300 |
-
|
301 |
-
# Move all letters to CAPS; keeps focus columns only
|
302 |
-
self.raw_seq_name_to_sequence = self.seq_name_to_sequence.copy()
|
303 |
-
for seq_name,sequence in self.seq_name_to_sequence.items():
|
304 |
-
sequence = sequence.replace(".","-")
|
305 |
-
self.seq_name_to_sequence[seq_name] = [sequence[ix].upper() for ix in self.focus_cols]
|
306 |
-
|
307 |
-
# Remove sequences that have indeterminate AA (e.g., B, J, X, Z) in the focus columns
|
308 |
-
if self.remove_sequences_with_indeterminate_AA_in_focus_cols:
|
309 |
-
alphabet_set = set(list(self.alphabet))
|
310 |
-
seq_names_to_remove = []
|
311 |
-
for seq_name,sequence in self.seq_name_to_sequence.items():
|
312 |
-
for letter in sequence:
|
313 |
-
if letter not in alphabet_set and letter != "-":
|
314 |
-
seq_names_to_remove.append(seq_name)
|
315 |
-
continue
|
316 |
-
seq_names_to_remove = list(set(seq_names_to_remove))
|
317 |
-
for seq_name in seq_names_to_remove:
|
318 |
-
del self.seq_name_to_sequence[seq_name]
|
319 |
-
|
320 |
-
# Encode the sequences
|
321 |
-
self.one_hot_encoding = np.zeros((len(self.seq_name_to_sequence.keys()),len(self.focus_cols),len(self.alphabet)))
|
322 |
-
if verbose: print("One-hot encoded sequences shape:" + str(self.one_hot_encoding.shape))
|
323 |
-
for i,seq_name in enumerate(self.seq_name_to_sequence.keys()):
|
324 |
-
sequence = self.seq_name_to_sequence[seq_name]
|
325 |
-
for j,letter in enumerate(sequence):
|
326 |
-
if letter in self.aa_dict:
|
327 |
-
k = self.aa_dict[letter]
|
328 |
-
self.one_hot_encoding[i,j,k] = 1.0
|
329 |
-
|
330 |
-
if self.use_weights:
|
331 |
-
try:
|
332 |
-
self.weights = np.load(file=self.weights_location)
|
333 |
-
if verbose: print("Loaded sequence weights from disk")
|
334 |
-
except:
|
335 |
-
if verbose: print ("Computing sequence weights")
|
336 |
-
list_seq = self.one_hot_encoding
|
337 |
-
list_seq = list_seq.reshape((list_seq.shape[0], list_seq.shape[1] * list_seq.shape[2]))
|
338 |
-
def compute_weight(seq):
|
339 |
-
number_non_empty_positions = np.dot(seq,seq)
|
340 |
-
if number_non_empty_positions>0:
|
341 |
-
denom = np.dot(list_seq,seq) / np.dot(seq,seq)
|
342 |
-
denom = np.sum(denom > 1 - self.theta)
|
343 |
-
return 1/denom
|
344 |
-
else:
|
345 |
-
return 0.0 #return 0 weight if sequence is fully empty
|
346 |
-
self.weights = np.array(list(map(compute_weight,list_seq)))
|
347 |
-
np.save(file=self.weights_location, arr=self.weights)
|
348 |
-
else:
|
349 |
-
# If not using weights, use an isotropic weight matrix
|
350 |
-
if verbose: print("Not weighting sequence data")
|
351 |
-
self.weights = np.ones(self.one_hot_encoding.shape[0])
|
352 |
-
|
353 |
-
self.Neff = np.sum(self.weights)
|
354 |
-
self.num_sequences = self.one_hot_encoding.shape[0]
|
355 |
-
self.seq_name_to_weight={}
|
356 |
-
for i,seq_name in enumerate(self.seq_name_to_sequence.keys()):
|
357 |
-
self.seq_name_to_weight[seq_name]=self.weights[i]
|
358 |
-
|
359 |
-
if verbose:
|
360 |
-
print ("Neff =",str(self.Neff))
|
361 |
-
print ("Data Shape =",self.one_hot_encoding.shape)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
utils/scoring_utils.py
DELETED
@@ -1,192 +0,0 @@
|
|
1 |
-
import os
|
2 |
-
import tqdm
|
3 |
-
import re
|
4 |
-
import numpy as np
|
5 |
-
import pandas as pd
|
6 |
-
|
7 |
-
import torch
|
8 |
-
from torch.nn import CrossEntropyLoss, NLLLoss
|
9 |
-
from torch.utils.data.sampler import Sampler, SequentialSampler
|
10 |
-
|
11 |
-
from transformers import DataCollatorForLanguageModeling, PreTrainedTokenizerFast
|
12 |
-
from datasets import Dataset
|
13 |
-
|
14 |
-
AA_vocab = "ACDEFGHIKLMNPQRSTVWY"
|
15 |
-
|
16 |
-
def get_mutated_sequence(focus_seq, mutant, start_idx=1, AA_vocab=AA_vocab):
|
17 |
-
"""
|
18 |
-
Helper function that mutates an input sequence (focus_seq) via an input mutation triplet (substitutions only).
|
19 |
-
Mutation triplet are typically based on 1-indexing: start_idx is used for switching to 0-indexing.
|
20 |
-
"""
|
21 |
-
mutated_seq = list(focus_seq)
|
22 |
-
for mutation in mutant.split(":"):
|
23 |
-
try:
|
24 |
-
from_AA, position, to_AA = mutation[0], int(mutation[1:-1]), mutation[-1]
|
25 |
-
except:
|
26 |
-
print("Issue with mutant: "+str(mutation))
|
27 |
-
relative_position = position - start_idx
|
28 |
-
assert (from_AA==focus_seq[relative_position]), "Invalid from_AA or mutant position: "+str(mutation)+" from_AA: "+str(from_AA) + " relative pos: "+str(relative_position) + " focus_seq: "+str(focus_seq)
|
29 |
-
assert (to_AA in AA_vocab) , "Mutant to_AA is invalid: "+str(mutation)
|
30 |
-
mutated_seq[relative_position] = to_AA
|
31 |
-
return "".join(mutated_seq)
|
32 |
-
|
33 |
-
def nanmean(v, *args, inplace=False, **kwargs):
|
34 |
-
if not inplace:
|
35 |
-
v = v.clone()
|
36 |
-
is_nan = torch.isnan(v)
|
37 |
-
v[is_nan] = 0
|
38 |
-
return v.sum(*args, **kwargs) / (~is_nan).float().sum(*args, **kwargs)
|
39 |
-
|
40 |
-
def nansum(v, *args, inplace=False, **kwargs):
|
41 |
-
if not inplace:
|
42 |
-
v = v.clone()
|
43 |
-
is_nan = torch.isnan(v)
|
44 |
-
v[is_nan] = 0
|
45 |
-
return v.sum(*args, **kwargs)
|
46 |
-
|
47 |
-
def get_optimal_window(mutation_position_relative, seq_len_wo_special, model_window):
|
48 |
-
"""
|
49 |
-
Helper function that selects an optimal sequence window that fits the maximum model context size.
|
50 |
-
If the sequence length is less than the maximum context size, the full sequence is returned.
|
51 |
-
"""
|
52 |
-
half_model_window = model_window // 2
|
53 |
-
if seq_len_wo_special <= model_window:
|
54 |
-
return [0,seq_len_wo_special]
|
55 |
-
elif mutation_position_relative < half_model_window:
|
56 |
-
return [0,model_window]
|
57 |
-
elif mutation_position_relative >= seq_len_wo_special - half_model_window:
|
58 |
-
return [seq_len_wo_special - model_window, seq_len_wo_special]
|
59 |
-
else:
|
60 |
-
return [max(0,mutation_position_relative-half_model_window), min(seq_len_wo_special,mutation_position_relative+half_model_window)]
|
61 |
-
|
62 |
-
def sequence_replace_single(sequence, char_to_replace, char_replacements):
|
63 |
-
char_replacements = list(char_replacements)
|
64 |
-
positions = [m.start() for m in re.finditer(char_to_replace, sequence)]
|
65 |
-
replacements = np.random.choice(a=char_replacements, size=len(positions), replace=True)
|
66 |
-
sequence=list(sequence)
|
67 |
-
for idx, position in enumerate(positions):
|
68 |
-
sequence[position]=replacements[idx]
|
69 |
-
return ''.join(sequence)
|
70 |
-
|
71 |
-
def sequence_replace(sequences, char_to_replace, char_replacements):
|
72 |
-
"""
|
73 |
-
Helper function that replaces all Amino Acids passsed in via char_to_replace (as a string of AAs) with Amino Acids sampled from char_replacements (also a string of eligible AAs).
|
74 |
-
"""
|
75 |
-
return [sequence_replace_single(sequence, char_to_replace, char_replacements) for sequence in sequences]
|
76 |
-
|
77 |
-
def get_tranception_scores_mutated_sequences(model, mutated_sequence_df, batch_size_inference, score_var_name, len_target_seq, num_workers=10, reverse=False, indel_mode=False):
|
78 |
-
"""
|
79 |
-
Helper function that takes as input a set of mutated sequences (in a pandas dataframe) and returns scores for each mutation (delta log likelihood wrt wild type sequence).
|
80 |
-
"""
|
81 |
-
scores = {}
|
82 |
-
scores['mutant']=[]
|
83 |
-
scores['window_start']=[]
|
84 |
-
scores['window_end']=[]
|
85 |
-
scores['score']=[]
|
86 |
-
with torch.no_grad():
|
87 |
-
ds = Dataset.from_pandas(mutated_sequence_df)
|
88 |
-
ds.set_transform(model.encode_batch)
|
89 |
-
data_collator = DataCollatorForLanguageModeling(
|
90 |
-
tokenizer=model.config.tokenizer,
|
91 |
-
mlm=False)
|
92 |
-
sampler = SequentialSampler(ds)
|
93 |
-
ds_loader = torch.utils.data.DataLoader(ds, batch_size=batch_size_inference, sampler=sampler, collate_fn=data_collator, num_workers=num_workers, pin_memory=True, drop_last=False)
|
94 |
-
mutant_index=0
|
95 |
-
for encoded_batch in tqdm.tqdm(ds_loader):
|
96 |
-
full_batch_length = len(encoded_batch['input_ids'])
|
97 |
-
scores['mutant'] += list(mutated_sequence_df['mutant'][mutant_index:mutant_index+full_batch_length])
|
98 |
-
window_start = np.array(mutated_sequence_df['window_start'][mutant_index:mutant_index+full_batch_length])
|
99 |
-
scores['window_start'] += list(window_start)
|
100 |
-
window_end = np.array(mutated_sequence_df['window_end'][mutant_index:mutant_index+full_batch_length])
|
101 |
-
scores['window_end'] += list(window_end)
|
102 |
-
full_raw_sequence = np.array(mutated_sequence_df['full_raw_sequence'][mutant_index:mutant_index+full_batch_length])
|
103 |
-
for k, v in encoded_batch.items():
|
104 |
-
if isinstance(v, torch.Tensor):
|
105 |
-
encoded_batch[k] = v.to(model.device)
|
106 |
-
shift_labels = encoded_batch['labels'][..., 1:].contiguous()
|
107 |
-
if (hasattr(model.config,"retrieval_aggregation_mode")) and (model.config.retrieval_aggregation_mode is not None):
|
108 |
-
if reverse:
|
109 |
-
encoded_batch['flip']=torch.tensor([1]*full_batch_length)
|
110 |
-
encoded_batch['start_slice']=window_start
|
111 |
-
encoded_batch['end_slice']=window_end
|
112 |
-
encoded_batch['full_raw_sequence'] = full_raw_sequence #only mutated_sequence is flipped if the scoring_mirror branch of score_mutants. No need to flip full_raw_sequence for MSA re-aligning
|
113 |
-
fused_shift_log_probas=model(**encoded_batch,return_dict=True).fused_shift_log_probas
|
114 |
-
loss_fct = NLLLoss(reduction='none')
|
115 |
-
loss = - loss_fct(input=fused_shift_log_probas.view(-1, fused_shift_log_probas.size(-1)), target=shift_labels.view(-1)).view(fused_shift_log_probas.shape[0],fused_shift_log_probas.shape[1])
|
116 |
-
else:
|
117 |
-
lm_logits=model(**encoded_batch,return_dict=True).logits
|
118 |
-
shift_logits = lm_logits[..., :-1, :].contiguous()
|
119 |
-
loss_fct = CrossEntropyLoss(reduction='none')
|
120 |
-
loss = - loss_fct(input=shift_logits.view(-1, shift_logits.size(-1)), target=shift_labels.view(-1)).view(shift_logits.shape[0],shift_logits.shape[1])
|
121 |
-
mask = encoded_batch['attention_mask'][..., 1:].float()
|
122 |
-
mask[mask==0]=float('nan')
|
123 |
-
loss *= mask
|
124 |
-
loss = nanmean(loss, dim=1)
|
125 |
-
scores_batch = list(loss.cpu().numpy())
|
126 |
-
full_batch_length = len(encoded_batch['input_ids'])
|
127 |
-
scores['score'] += scores_batch
|
128 |
-
mutant_index+=full_batch_length
|
129 |
-
scores = pd.DataFrame(scores)
|
130 |
-
scores_mutated_seq = scores[scores.mutant != 'wt']
|
131 |
-
scores_wt = scores[scores.mutant == 'wt']
|
132 |
-
delta_scores = pd.merge(scores_mutated_seq,scores_wt,how='left',on=['window_start'],suffixes=('','_wt'))
|
133 |
-
delta_scores[score_var_name] = delta_scores['score'] - delta_scores['score_wt']
|
134 |
-
delta_scores=delta_scores[['mutant',score_var_name]].groupby('mutant').mean().reset_index()
|
135 |
-
return delta_scores
|
136 |
-
|
137 |
-
def get_sequence_slices(df, target_seq, model_context_len, start_idx=1, scoring_window="optimal", indel_mode=False):
|
138 |
-
"""
|
139 |
-
Helper function that takes as input a (pandas) dataframe df that contains a list of mutant triplets (substitutions) or full mutated sequences (indels) for scoring.
|
140 |
-
It returns a processed DMS in which sequences have been sliced to satisfy the maximum context window of the model.
|
141 |
-
df: (dataframe) Input dataframe to be processed
|
142 |
-
target_seq: (string) Full reference sequence (wild type) that is mutated in the DMS assay.
|
143 |
-
model_context_len: (int) Maximum context size for the model.
|
144 |
-
start_idx: (int) Integer to move to 0-indexing of positions (mutation triplet are typically based on 1-indexing).
|
145 |
-
scoring_window: (string) Method to slice sequences longer than maximum context size:
|
146 |
-
- optimal selects a single window as large as possible via the get_optimal_window function (this is the default)
|
147 |
-
- sliding splits the full sequence in contiguous (non-overlapping) chunks that are of size equal to the max context (except the last chunk which may be shorter)
|
148 |
-
indel_mode: (bool) Flag to be used when scoring insertions and deletions. Otherwise assumes substitutions.
|
149 |
-
Note: when scoring indels for sequences that would be longer than the model max context length, it is preferable to use the "sliding" scoring_window. Use "optimal" otherwise.
|
150 |
-
"""
|
151 |
-
len_target_seq = len(target_seq)
|
152 |
-
num_mutants = len(df['mutant'])
|
153 |
-
df=df.reset_index(drop=True)
|
154 |
-
if scoring_window=="optimal":
|
155 |
-
df['mutation_barycenter'] = df['mutant'].apply(lambda x: int(np.array([int(mutation[1:-1]) - start_idx for mutation in x.split(':')]).mean())) if not indel_mode else df['mutant'].apply(lambda x: len(x)//2)
|
156 |
-
df['scoring_optimal_window'] = df['mutation_barycenter'].apply(lambda x: get_optimal_window(x, len_target_seq, model_context_len)) if not indel_mode else df['mutant'].apply(lambda x: (0,len(x)))
|
157 |
-
df['full_raw_sequence'] = df['mutated_sequence']
|
158 |
-
df['mutated_sequence'] = [df['mutated_sequence'][index][df['scoring_optimal_window'][index][0]:df['scoring_optimal_window'][index][1]] for index in range(num_mutants)]
|
159 |
-
df['window_start'] = df['scoring_optimal_window'].map(lambda x: x[0])
|
160 |
-
df['window_end'] = df['scoring_optimal_window'].map(lambda x: x[1])
|
161 |
-
del df['scoring_optimal_window']
|
162 |
-
df_wt=df.copy()
|
163 |
-
df_wt['mutant'] = ['wt'] * num_mutants
|
164 |
-
df_wt['full_raw_sequence'] = [target_seq] * num_mutants
|
165 |
-
if indel_mode: # For indels, we set the wild type reference to be always the same (full length) sequence. We assume here that the length is lower than model context size (otherwise use "Sliding")
|
166 |
-
df_wt['mutation_barycenter'] = [len_target_seq // 2] * num_mutants
|
167 |
-
df_wt['window_end'] = df_wt['full_raw_sequence'].map(lambda x:len(x))
|
168 |
-
df_wt['mutated_sequence'] = [target_seq[df_wt['window_start'][index]:df_wt['window_end'][index]] for index in range(num_mutants)]
|
169 |
-
df = pd.concat([df,df_wt], axis=0)
|
170 |
-
df = df.drop_duplicates()
|
171 |
-
elif scoring_window=="sliding":
|
172 |
-
len_target_seq = len(target_seq)
|
173 |
-
num_windows = 1 + int( len_target_seq / model_context_len)
|
174 |
-
df_list=[]
|
175 |
-
start=0
|
176 |
-
for window_index in range(1, num_windows+1):
|
177 |
-
df_sliced = df.copy()
|
178 |
-
df_sliced['full_raw_sequence'] = df_sliced['mutated_sequence']
|
179 |
-
df_sliced['mutated_sequence'] = df_sliced['mutated_sequence'].map(lambda x: x[start:start+model_context_len])
|
180 |
-
df_sliced['window_start'] = [start] * num_mutants
|
181 |
-
df_sliced['window_end'] = df_sliced['full_raw_sequence'].map(lambda x: min(len(x), start+model_context_len))
|
182 |
-
df_sliced_wt = df_sliced.copy()
|
183 |
-
df_sliced_wt['mutant'] = ['wt'] * num_mutants
|
184 |
-
df_sliced_wt['full_raw_sequence'] = [target_seq] * num_mutants
|
185 |
-
df_sliced_wt['mutated_sequence'] = df_sliced_wt['full_raw_sequence'].map(lambda x: x[start:start+model_context_len])
|
186 |
-
df_sliced_wt['window_end'] = df_sliced_wt['full_raw_sequence'].map(lambda x: min(len(x), start+model_context_len)) #Need to adjust end index if WT and sequence are not same full length
|
187 |
-
df_list.append(df_sliced)
|
188 |
-
df_list.append(df_sliced_wt)
|
189 |
-
start += model_context_len
|
190 |
-
df_final = pd.concat(df_list,axis=0)
|
191 |
-
df = df_final.drop_duplicates()
|
192 |
-
return df.reset_index(drop=True)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
utils/tokenizers/Basic_tokenizer
DELETED
@@ -1 +0,0 @@
|
|
1 |
-
{"version":"1.0","truncation":null,"padding":null,"added_tokens":[{"id":0,"special":true,"content":"[UNK]","single_word":false,"lstrip":false,"rstrip":false,"normalized":false},{"id":1,"special":true,"content":"[CLS]","single_word":false,"lstrip":false,"rstrip":false,"normalized":false},{"id":2,"special":true,"content":"[SEP]","single_word":false,"lstrip":false,"rstrip":false,"normalized":false},{"id":3,"special":true,"content":"[PAD]","single_word":false,"lstrip":false,"rstrip":false,"normalized":false},{"id":4,"special":true,"content":"[MASK]","single_word":false,"lstrip":false,"rstrip":false,"normalized":false}],"normalizer":null,"pre_tokenizer":{"type":"Whitespace"},"post_processor":{"type":"TemplateProcessing","single":[{"SpecialToken":{"id":"[CLS]","type_id":0}},{"Sequence":{"id":"A","type_id":0}},{"SpecialToken":{"id":"[SEP]","type_id":0}}],"pair":[{"SpecialToken":{"id":"[CLS]","type_id":0}},{"Sequence":{"id":"A","type_id":0}},{"SpecialToken":{"id":"[SEP]","type_id":0}},{"Sequence":{"id":"B","type_id":1}},{"SpecialToken":{"id":"[SEP]","type_id":1}}],"special_tokens":{"[CLS]":{"id":"[CLS]","ids":[1],"tokens":["[CLS]"]},"[SEP]":{"id":"[SEP]","ids":[2],"tokens":["[SEP]"]}}},"decoder":null,"model":{"type":"BPE","dropout":null,"unk_token":"[UNK]","continuing_subword_prefix":null,"end_of_word_suffix":null,"fuse_unk":false,"vocab":{"[UNK]":0,"[CLS]":1,"[SEP]":2,"[PAD]":3,"[MASK]":4,"A":5,"C":6,"D":7,"E":8,"F":9,"G":10,"H":11,"I":12,"K":13,"L":14,"M":15,"N":16,"P":17,"Q":18,"R":19,"S":20,"T":21,"V":22,"W":23,"Y":24},"merges":[]}}
|
|
|
|