BioMike commited on
Commit
f9deec3
·
verified ·
1 Parent(s): a7a429e

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +126 -0
README.md CHANGED
@@ -1,3 +1,129 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ metrics:
4
+ - accuracy
5
+ - bleu
6
+ pipeline_tag: text2text-generation
7
+ tags:
8
+ - chemistry
9
+ - biology
10
+ - medical
11
+ - smiles
12
+ - iupac
13
+ - text-generation-inference
14
+ widget:
15
+ - text: CCO
16
+ example_title: ethanol
17
  ---
18
+ # SMILES2IUPAC-canonical-small
19
+
20
+ SMILES2IUPAC-canonical-small was designed to accurately translate SMILES chemical names to IUPAC standards.
21
+
22
+ ## Model Details
23
+
24
+ ### Model Description
25
+
26
+ SMILES2IUPAC-canonical-small is based on the MT5 model with optimizations in implementing different tokenizers for the encoder and decoder.
27
+ - **Developed by:** Knowladgator Engineering
28
+ - **Model type:** Encoder-Decoder with attention mechanism
29
+ - **Language(s) (NLP):** SMILES, IUPAC (English)
30
+ - **License:** Apache License 2.0
31
+
32
+ ### Model Sources
33
+ - **Paper:** coming soon
34
+ - **Demo:** [ChemicalConverters](https://huggingface.co/spaces/knowledgator/ChemicalConverters)
35
+
36
+ ## Quickstart
37
+ Firstly, install the library:
38
+ ```commandline
39
+ pip install chemical-converters
40
+ ```
41
+ ### SMILES to IUPAC
42
+ #### ! Preferred IUPAC style
43
+ To choose the preferred IUPAC style, place style tokens before
44
+ your SMILES sequence.
45
+
46
+ | Style Token | Description |
47
+ |-------------|----------------------------------------------------------------------------------------------------|
48
+ | `<BASE>` | The most known name of the substance, sometimes is the mixture of traditional and systematic style |
49
+ | `<SYST>` | The totally systematic style without trivial names |
50
+ | `<TRAD>` | The style is based on trivial names of the parts of substances |
51
+
52
+ #### To perform simple translation, follow the example:
53
+ ```python
54
+ from chemicalconverters import NamesConverter
55
+
56
+ converter = NamesConverter(model_name="knowledgator/SMILES2IUPAC-canonical-small")
57
+ print(converter.smiles_to_iupac('CCO'))
58
+ print(converter.smiles_to_iupac(['<SYST>CCO', '<TRAD>CCO', '<BASE>CCO']))
59
+ ```
60
+ ```text
61
+ ['ethanol']
62
+ ['ethanol', 'ethanol', 'ethanol']
63
+ ```
64
+ #### Processing in batches:
65
+ ```python
66
+ from chemicalconverters import NamesConverter
67
+
68
+ converter = NamesConverter(model_name="knowledgator/SMILES2IUPAC-canonical-small")
69
+ print(converter.smiles_to_iupac(["<BASE>C=CC=C" for _ in range(10)], num_beams=1,
70
+ process_in_batch=True, batch_size=1000))
71
+ ```
72
+ ```text
73
+ ['buta-1,3-diene', 'buta-1,3-diene'...]
74
+ ```
75
+ #### Validation SMILES to IUPAC translations
76
+ It's possible to validate the translations by reverse translation into IUPAC
77
+ and calculating Tanimoto similarity of two molecules fingerprints.
78
+ ````python
79
+ from chemicalconverters import NamesConverter
80
+
81
+ converter = NamesConverter(model_name="knowledgator/SMILES2IUPAC-canonical-small")
82
+ print(converter.smiles_to_iupac('CCO', validate=True))
83
+ ````
84
+ ````text
85
+ ['ethanol'] 1.0
86
+ ````
87
+ The larger is Tanimoto similarity, the larger is probability, that the prediction was correct.
88
+
89
+ You can also process validation manually:
90
+ ```python
91
+ from chemicalconverters import NamesConverter
92
+
93
+ validation_model = NamesConverter(model_name="knowledgator/IUPAC2SMILES-canonical-base")
94
+ print(NamesConverter.validate_iupac(input_sequence='CCO', predicted_sequence='CCO', validation_model=validation_model))
95
+ ```
96
+ ```text
97
+ 1.0
98
+ ```
99
+
100
+ ## Bias, Risks, and Limitations
101
+
102
+ This model has limited accuracy in processing large molecules and currently, doesn't support isomeric and isotopic SMILES.
103
+
104
+ ### Training Procedure
105
+
106
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
107
+
108
+ The model was trained on 100M examples of SMILES-IUPAC pairs with lr=0.0003, batch_size=1024 for 2 epochs.
109
+
110
+ ## Evaluation
111
+
112
+ | Model | Accuracy | BLEU-4 score | Size(MB) |
113
+ |-------------------------------------|---------|------------------|----------|
114
+ | SMILES2IUPAC-canonical-small |75%| 0.93 | 23 |
115
+ | SMILES2IUPAC-canonical-base |86.9%|0.964|180|
116
+ | STOUT V2.0\* | 66.65% | 0.92 | 128 |
117
+ | STOUT V2.0 (according to our tests) | | 0.89 | 128 |
118
+ *According to the original paper https://jcheminf.biomedcentral.com/articles/10.1186/s13321-021-00512-4
119
+
120
+ ## Citation
121
+ Coming soon.
122
+
123
+ ## Model Card Authors
124
+
125
+ [Mykhailo Shtopko](https://huggingface.co/BioMike)
126
+
127
+ ## Model Card Contact
128
+
129
+ info@knowledgator.com