AI4PD
/

REXzyme

@@ -30,14 +30,14 @@ REXzyme (Reaction to Enzyme) (manuscript in preparation) is a translation machin
 It is possible to provide fine-grained input at the substrate level.
 Akin to how translation machines have learned to translate between complex language pairs with great success,
 often diverging in their representation at the character level, (Japanese - English), we posit that an advanced architecture will
-be able to translate between the chemical and sequence spaces. REXzyme was trained on a set of xx reactions and yy enzyme pairs and it produces
 sequences that putatitely perform their intended reactions.
 To run it, you will need to provide a reaction in the SMILE format (Simplified molecular-input line-entry system), which you can do online here: https://cactus.nci.nih.gov/chemical/structure.
-After converting each of the reaction components you should combine them in the following scheme : ReactantA.ReactantB>AgentA>ProductA.ProductB
-Additionally prepend the task suffix "r2s" and append the eos token "</s>"
-e.g. for the carbonic anhydrase "r2sO.COO>>HCOOO.[H+]</s>"
 or via this simple python script:
@@ -75,9 +75,9 @@ and contains 48 (24 el/ 24 dl) layers with a model dimensionality of 1024, total
 REXzyme is a translation machine trained on portion the RHEA database containing 31970152 reaction-enzyme pairs.
 The pre-training was done on pairs of smiles and amino acid sequences, tokenized with a char-level
-Sentencepiece tokenizer. Note that two seperate tokenizers were used for input and labels.
-REXzyme was pre-trained with a supervised translation objective  i.e., the model learned to use the continous representation of the reaction from the encoder to autoregressivly (causual language modeling) produce the output trying to match the shifted right target enzyme sequence. Hence, the model learns the dependencies among protein sequence features that enable a specific enzymatic reaction.
 There are stark differences in the number of members among Reaction classes, and for this reason. Since we are tokenizing the reaction smiles on a char level, classes with few reactions can profit from the knwodledge gained for classes catalyzing similar reactions that have a lot of members.
@@ -87,6 +87,7 @@ The figure below summarizes the process of training: (add figure) [STILL MISSING
 ## **Model Performance**
 - **Dataset curation**
 <br/><br/>
 - **General descriptors**
@@ -94,7 +95,7 @@ The figure below summarizes the process of training: (add figure) [STILL MISSING
     | :---                  |    :----:   |          ---:            |
     | **IUPRED3 (ordered)** | 99.9%       | 99.9%                    |
     | **ESMFold**           | 85.03       | 71.59 (selected: 79.82)  |
-    | **FlDPnn**            | missing     | missing                  |
     | **PSIpred**           | missing     | missing                  |
 <br/><br/>

 It is possible to provide fine-grained input at the substrate level.
 Akin to how translation machines have learned to translate between complex language pairs with great success,
 often diverging in their representation at the character level, (Japanese - English), we posit that an advanced architecture will
+be able to translate between the chemical and sequence spaces. REXzyme was trained on a set of 2480 reactions and ~32M enzyme pairs and it produces
 sequences that putatitely perform their intended reactions.
 To run it, you will need to provide a reaction in the SMILE format (Simplified molecular-input line-entry system), which you can do online here: https://cactus.nci.nih.gov/chemical/structure.
+After converting each of the reaction components you should combine them in the following scheme : ```ReactantA.ReactantB>AgentA>ProductA.ProductB```<br/>
+Additionally prepend the task suffix ```r2s``` and append the eos token ```</s>```
+e.g. for the carbonic anhydrase ```r2sO.COO>>HCOOO.[H+]</s>```
 or via this simple python script:
 REXzyme is a translation machine trained on portion the RHEA database containing 31970152 reaction-enzyme pairs.
 The pre-training was done on pairs of smiles and amino acid sequences, tokenized with a char-level
+Sentencepiece tokenizer. Note that two seperate tokenizers were used for input (smiles_tokenizer) and labels (aa_tokenizer).
+REXzyme was pre-trained with a supervised translation objective  i.e., the model learned to use the continous representation of the reaction from the encoder to autoregressivly (causual language modeling) produce the output by shifting it right on token (amino acid) at a time trying to match the target enzyme sequence. Hence, the model learns the dependencies among protein sequence features that enable a specific enzymatic reaction.
 There are stark differences in the number of members among Reaction classes, and for this reason. Since we are tokenizing the reaction smiles on a char level, classes with few reactions can profit from the knwodledge gained for classes catalyzing similar reactions that have a lot of members.
 ## **Model Performance**
 - **Dataset curation**
+We converted the reactions from rxn format to smile string including only left-to-right reactions. The enzyme sequences were truncated to 1024. Enzymes catalyzing more than one reaction were given multiple enzyme-reaction entries.
 <br/><br/>
 - **General descriptors**
     | :---                  |    :----:   |          ---:            |
     | **IUPRED3 (ordered)** | 99.9%       | 99.9%                    |
     | **ESMFold**           | 85.03       | 71.59 (selected: 79.82)  |
+    | **FlDPnn**            | missing     | 0.0929                  |
     | **PSIpred**           | missing     | missing                  |
 <br/><br/>