nferruz Bienenwolf655 commited on
Commit
8fada0b
1 Parent(s): 0a66cf9

Update README.md (#2)

Browse files

- Update README.md (0c9bb0a56d3e4a6b4912c76a2ffc5a53ea36e09a)


Co-authored-by: Sebastian Lindner <Bienenwolf655@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +140 -38
README.md CHANGED
@@ -4,70 +4,172 @@ pipeline_tag: translation
4
  tags:
5
  - chemistry
6
  - biology
 
 
 
 
 
 
 
7
  ---
8
 
9
  # **Contributors**
10
 
11
- - Sebastian Lindner (GitHub [@Bienenwolf655](https://www.google.com); Twitter @)
12
- - Michael Heinzinger (GitHub @mheinzinger; Twitter @)
13
- - Noelia Ferruz (GitHub @noeliaferruz; Twitter @ferruz_noelia; Webpage: www.aiproteindesign.com )
14
 
15
- # **REXyme: A Translation Machine for the Generation of New-to-Nature Enzymes**
 
16
  **Work in Progress**
17
 
18
- REXyme (Reaction to Enzyme) (manuscript in preparation) is a translation machine for the generation of enzyme that catalize user-defined reactions.
 
 
 
19
  It is possible to provide fine-grained input at the substrate level.
20
  Akin to how translation machines have learned to translate between complex language pairs with great success,
21
  often diverging in their representation at the character level, (Japanese - English), we posit that an advanced architecture will
22
- be able to translate between the chemical and sequence spaces. REXyme was trained on a set of xx reactions and yy enzyme pairs and it produces
23
  sequences that putatitely perform their intended reactions.
24
 
25
- To run it, you will need to provide a reaction in the SMILE format (Simplified molecular-input line-entry system),
26
- which you can do online here: xxxx
27
 
28
- We are still working in the analysis of the model for different tasks, including experimental testing.
29
- See below for information about the models' performance in different in-silico tasks and how to generate your own enzymes.
 
30
 
31
- ## **Model description**
32
- REXyme is based on the [Efficient T5 Transformer](xx) architecture (which in turn is very similar to the current version of Google Translator)
33
- and contains xx layers
34
- with a model dimensionality of xx, totaling xx million parameters.
35
 
36
- REXyme is a translation machine trained on the xx database containing xx reaction-enzyme pairs.
37
- The pre-training was done on pairs of smiles and ... (fasta headers?),
 
 
 
38
 
39
- ZymCTRL was trained with an autoregressive objective (this is not right, check it ??) i.e., the model learns to predict a missing
40
- token in the encoder's input. Hence,
41
- the model learns the dependencies among protein sequence features that enable a specific enzymatic reaction.
42
 
43
- Sebastian check if this applies?? There are stark differences in the number of members among EC classes, and for this reason, we also tokenized the EC numbers.
44
- In this manner, EC numbers '2.7.1.1' and '2.7.1.2' share the first three tokens (six, including separators), and hence the model can infer that
45
- there are relationships between the two classes.
 
 
 
 
46
 
47
- The figure below summarizes the process of training: (add figure)
 
 
 
 
48
 
49
- ## **Model Performance**
 
50
 
51
- - explain dataset curation
52
- - general descriptors (esmfold, iuored.. )
53
- - second pgp
54
- - mmseqs (Average?)
55
 
 
 
56
 
57
- ## **How to generate from REXyme**
58
- REXyme can be used with the HuggingFace transformer python package.
59
- Detailed installation instructions can be found here: https://huggingface.co/docs/transformers/installation
60
 
61
- Since REXyme has been trained on the objective of machine translation, users have to specify a chemical reaction, specified in the format of SMILES.
62
 
63
- [please seb include snippet to generate sequences]
64
 
 
65
 
66
- ## **A word of caution**
67
 
68
- - We have not yet fully tested the ability of the model for the generation of new-to-nature enzymes, i.e.,
69
- with chemical reactions that do not appear in Nature (and hence neither in the training set). While this is the intended objective of our work,
70
- it is very much work in progress. We'll uptadate the model and documentation shortly.
71
-
72
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
 
 
 
 
4
  tags:
5
  - chemistry
6
  - biology
7
+ widget:
8
+ - text: "r2sNC1=NC=NC2=C1N=CN2[C@@H]1O[C@H](COP(=O)([O-])OP(=O)([O-])OP(=O)([O-])[O-])[C@@H](O)[C@H]1O.*N[C@@H](CO)C(*)=O>>NC1=NC=NC2=C1N=CN2[C@@H]1O[C@H](COP(=O)([O-])OP(=O)([O-])[O-])[C@@H](O)[C@H]1O.[H+].*N[C@@H](COP(=O)([O-])[O-])C(*)=O</s>"
9
+ inference:
10
+ parameters:
11
+ top_k: 15
12
+ top_p: 0.92
13
+ repetition_penalty: 1.2
14
  ---
15
 
16
  # **Contributors**
17
 
18
+ - Sebastian Lindner (GitHub [@Bienenwolf655](https://github.com/Bienenwolf655); Twitter [@lindner_seb](https://twitter.com/lindner_seb))
19
+ - Michael Heinzinger (GitHub [@mheinzinger](https://github.com/@mheinzinger); Twitter [@HeinzingerM](https://twitter.com/lindner_seb))
20
+ - Noelia Ferruz (GitHub [@noeliaferruz](https://github.com/@noeliaferruz); Twitter [@ferruz_noelia](https://twitter.com/ferruz_noelia); Webpage: [www.aiproteindesign.com](www.aiproteindesign.com) )
21
 
22
+
23
+ # **REXzyme: A Translation Machine for the Generation of New-to-Nature Enzymes**
24
  **Work in Progress**
25
 
26
+ REXzyme (Reaction to Enzyme) (manuscript in preparation) is a translation machine for the generation of enzyme that catalize user-defined reactions.
27
+
28
+ ![Inference of REXzyme](./figures__.004.jpeg)
29
+
30
  It is possible to provide fine-grained input at the substrate level.
31
  Akin to how translation machines have learned to translate between complex language pairs with great success,
32
  often diverging in their representation at the character level, (Japanese - English), we posit that an advanced architecture will
33
+ be able to translate between the chemical and sequence spaces. REXzyme was trained on a set of xx reactions and yy enzyme pairs and it produces
34
  sequences that putatitely perform their intended reactions.
35
 
36
+ To run it, you will need to provide a reaction in the SMILE format (Simplified molecular-input line-entry system), which you can do online here: https://cactus.nci.nih.gov/chemical/structure.
 
37
 
38
+ After converting each of the reaction components you should combine them in the following scheme : ReactantA.ReactantB>AgentA>ProductA.ProductB
39
+ Additionally prepend the task suffix "r2s" and append the eos token "</s>"
40
+ e.g. for the carbonic anhydrase "r2sO.COO>>HCOOO.[H+]</s>"
41
 
42
+ or via this simple python script:
 
 
 
43
 
44
+ ```python
45
+ # left reactants (seperated by +) seperated by a equal sign from the products (seperated by +)
46
+ reactions = "CO2 + H2O = carbonic acid + H+"
47
+ # agents (seperated by +)
48
+ agent = ""
49
 
50
+ # https://stackoverflow.com/questions/54930121/converting-molecule-name-to-smiles
51
+ from urllib.request import urlopen
52
+ from urllib.parse import quote
53
 
54
+ def CIRconvert(ids):
55
+ try:
56
+ url = 'http://cactus.nci.nih.gov/chemical/structure/' + quote(ids) + '/smiles'
57
+ ans = urlopen(url).read().decode('utf8')
58
+ return ans
59
+ except:
60
+ return 'Did not work'
61
 
62
+ reagent = [CIRconvert(i) for i in reactions.replace(' ','').split('=')[0].split('+') if i != ""]
63
+ agent = [CIRconvert(i) for i in agent.replace(' ','').split('+') if i != ""]
64
+ product = [CIRconvert(i) for i in reactions.replace(' ','').split('=')[1].split('+') if i != ""]
65
+ f"r2s{'.'.join(reagent)}>{'.'.join(agent)}>{'.'.join(product)}</s>"
66
+ ```
67
 
68
+ We are still working in the analysis of the model for different tasks, including experimental testing.
69
+ See below for information about the models' performance in different in-silico tasks and how to generate your own enzymes.
70
 
71
+ ## **Model description**
 
 
 
72
 
73
+ REXzyme is based on the [Efficient T5 Large Transformer](https://huggingface.co/google/t5-efficient-large) architecture (which in turn is very similar to the current version of Google Translator)
74
+ and contains 48 (24 el/ 24 dl) layers with a model dimensionality of 1024, totaling 737.72 million parameters.
75
 
76
+ REXzyme is a translation machine trained on portion the RHEA database containing 31970152 reaction-enzyme pairs.
77
+ The pre-training was done on pairs of smiles and amino acid sequences, tokenized with a char-level
78
+ Sentencepiece tokenizer. Note that two seperate tokenizers were used for input and labels.
79
 
80
+ REXzyme was pre-trained with a supervised translation objective i.e., the model learned to use the continous representation of the reaction from the encoder to autoregressivly (causual language modeling) produce the output trying to match the shifted right target enzyme sequence. Hence, the model learns the dependencies among protein sequence features that enable a specific enzymatic reaction.
81
 
82
+ There are stark differences in the number of members among Reaction classes, and for this reason. Since we are tokenizing the reaction smiles on a char level, classes with few reactions can profit from the knwodledge gained for classes catalyzing similar reactions that have a lot of members.
83
 
84
+ The figure below summarizes the process of training: (add figure) [STILL MISSING!]
85
 
 
86
 
87
+ ## **Model Performance**
 
 
 
88
 
89
+ - **Dataset curation**
90
+ <br/><br/>
91
+ - **General descriptors**
92
+
93
+ | Method | Natural | Generated |
94
+ | :--- | :----: | ---: |
95
+ | **IUPRED3 (ordered)** | 99.9% | 99.9% |
96
+ | **ESMFold** | 85.03 | 71.59 (selected: 79.82) |
97
+ | **FlDPnn** | missing | missing |
98
+ | **PSIpred** | missing | missing |
99
+ <br/><br/>
100
+
101
+ - **PGP pipeline**
102
+
103
+ | Method | Natural | Generated |
104
+ | :--- | :---- | :--- |
105
+ | **Disorder** | 11.473 | 11.467 |
106
+ | **pggp3** | L: 42%, H: 41%, E:18% | L: 45%, H: 39%, E: 16%|
107
+ | **pggp8** | C:25%, H:38% T:10%, S:5%, I:0%, E:19%, G:2%, B:0% | C:29%, H:38% T:10%, S:4%, I:0%, E:17%, G:3%, B:0%|
108
+ | **CATH Classes** | Mainly Beta: 6%, Alpha Beta: 78%, Mainly Alpha: 16%, Special: 0%, Few Secondary Structures: 0% | Mainly Beta: 4%, Alpha Beta: 87%, Mainly Alpha: 9%, Special: 0%, Few Secondary Structures: 0%|
109
+ | **Transmembrane Prediction** | Membrane: 9%, Soluble: 91% | Membrane: 9%, Soluble: 91%|
110
+ | **Conservation** | High: 37%, Low: 33% | High: 38%, Low: 33% |
111
+ | **Localization** | Cytop.: 66%, Nucleus: 4%, Extracellular: 6%, PM: 4%, ER: 11%, Lysosome/Vacuole: 1%, Mito.: 6%, Plastid: 1%, Golgi: 1%, Perox.: 1% | Cytop.: 85%, Nucleus: 2%, Extracellular: 6%, PM: 1%, ER: 6%, Lysosome/Vacuole: 0%, Mito.: 4%, Plastid: 0%, Golgi: 0%, Perox.: 0%|
112
+ <br/><br/>
113
+
114
+ - **Sequence similarity to the natural space**
115
+
116
+ | Syntax | Identity | Alignment length |
117
+ | :--- | :----: | ---: |
118
+ | **Generated** | 74.29% | 406.0 |
119
+ | **Selection (<70%)**| 57.20% | 338.1 |
120
+ <br/><br/>
121
+
122
+ ## **How to generate from REXzyme**
123
+ REXzyme can be used with the HuggingFace transformer python package.
124
+ Detailed installation instructions can be found [here](https://huggingface.co/docs/transformers/installation)
125
+
126
+ Since REXzyme has been trained on the objective of machine translation, users have to specify a chemical reaction, specified in the format of SMILES.
127
+
128
+ Disclaimer: Although the perplexity gets computed here it is not the best selection criteria. Usually the BLEU score is deployed for translation evaluation, but this score would enforce a high sequence similarity thus not *de novo* design. We recommend generating many sequences and selecting them by plDDT as well as low identity.
129
+
130
+ ```python
131
+ from datasets import load_from_disk
132
+ from transformers import AutoTokenizer
133
+ from transformers import T5Tokenizer, T5ForConditionalGeneration
134
+ import math
135
+ import torch
136
+ from tqdm import tqdm
137
+ import pickle
138
+ tokenizer_aa = AutoTokenizer.from_pretrained('/path/to//tokenizer_aa')
139
+ tokenizer_smiles = AutoTokenizer.from_pretrained('/path/to//tokenizer_smiles')
140
+
141
+ model = T5ForConditionalGeneration.from_pretrained("/path/to/REXzyme").cuda()
142
+ print(model.generation_config)
143
+ reactions = ["NC1=NC=NC2=C1N=CN2[C@@H]1O[C@H](COP(=O)([O-])OP(=O)([O-])OP(=O)([O-])[O-])[C@@H](O)[C@H]1O.*N[C@@H](CO)C(*)=O>>NC1=NC=NC2=C1N=CN2[C@@H]1O[C@H](COP(=O)([O-])OP(=O)([O-])[O-])[C@@H](O)[C@H]1O.[H+].*N[C@@H](COP(=O)([O-])[O-])C(*)=O"]
144
+
145
+ def calculatePerplexity(inputs,model):
146
+ '''Function to compute perplexity'''
147
+ a=tokenizer_aa.decode(inputs)
148
+ b=tokenizer_aa(a, return_tensors="pt").input_ids.to(device='cuda')
149
+ b = torch.stack([[b[b!=tokenizer_aa.pad_token_id]] for label in b][0])
150
+ with torch.no_grad():
151
+ outputs = model(b, labels=b)
152
+ loss, logits = outputs[:2]
153
+ return math.exp(loss)
154
+
155
+
156
+ for idx,i in tqdm(enumerate(reactions)):
157
+ input_ids = tokenizer_smiles(f"r2s{i}</s>", return_tensors="pt").input_ids.to(device='cuda')
158
+ print(f'Generating for {i}')
159
+ ppls_total = []
160
+ for _ in range(4):
161
+ outputs = model.generate(input_ids,
162
+ top_k=15,
163
+ top_p = 0.92,
164
+ repetition_penalty=1.2,
165
+ max_length=1024,
166
+ do_sample=True,
167
+ num_return_sequences=25)
168
+ ppls = [(tokenizer_aa.decode(output,skip_special_tokens=True), calculatePerplexity(output, model),len(tokenizer_aa.decode(output))) for output in tqdm(outputs)]
169
+ ppls_total.extend(ppls)
170
+ ```
171
+
172
+ ## **A word of caution**
173
 
174
+ - We have not yet fully tested the ability of the model for the generation of new-to-nature enzymes, i.e.,
175
+ with chemical reactions that do not appear in Nature (and hence neither in the training set). While this is the intended objective of our work, it is very much work in progress. We'll uptadate the model and documentation shortly.