File size: 4,516 Bytes
eec6017 f9deec3 eec6017 ef880ea f9deec3 ef880ea f9deec3 382ac21 f9deec3 ef880ea f9deec3 ef880ea f9deec3 ef880ea f9deec3 035c811 f9deec3 382ac21 f9deec3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 |
---
license: apache-2.0
metrics:
- accuracy
- bleu
pipeline_tag: text2text-generation
tags:
- chemistry
- biology
- medical
- smiles
- iupac
- text-generation-inference
widget:
- text: CCO
example_title: ethanol
---
# SMILES2IUPAC-canonical-base
SMILES2IUPAC-canonical-base was designed to accurately translate SMILES chemical names to IUPAC standards.
## Model Details
### Model Description
SMILES2IUPAC-canonical-base is based on the MT5 model with optimizations in implementing different tokenizers for the encoder and decoder.
- **Developed by:** Knowladgator Engineering
- **Model type:** Encoder-Decoder with attention mechanism
- **Language(s) (NLP):** SMILES, IUPAC (English)
- **License:** Apache License 2.0
### Model Sources
- **Paper:** coming soon
- **Demo:** [ChemicalConverters](https://huggingface.co/spaces/knowledgator/ChemicalConverters)
## Quickstart
Firstly, install the library:
```commandline
pip install chemical-converters
```
### SMILES to IUPAC
#### ! Preferred IUPAC style
To choose the preferred IUPAC style, place style tokens before
your SMILES sequence.
| Style Token | Description |
|-------------|----------------------------------------------------------------------------------------------------|
| `<BASE>` | The most known name of the substance, sometimes is the mixture of traditional and systematic style |
| `<SYST>` | The totally systematic style without trivial names |
| `<TRAD>` | The style is based on trivial names of the parts of substances |
#### To perform simple translation, follow the example:
```python
from chemicalconverters import NamesConverter
converter = NamesConverter(model_name="knowledgator/SMILES2IUPAC-canonical-base")
print(converter.smiles_to_iupac('CCO'))
print(converter.smiles_to_iupac(['<SYST>CCO', '<TRAD>CCO', '<BASE>CCO']))
```
```text
['ethanol']
['ethanol', 'ethanol', 'ethanol']
```
#### Processing in batches:
```python
from chemicalconverters import NamesConverter
converter = NamesConverter(model_name="knowledgator/SMILES2IUPAC-canonical-base")
print(converter.smiles_to_iupac(["<BASE>C=CC=C" for _ in range(10)], num_beams=1,
process_in_batch=True, batch_size=1000))
```
```text
['buta-1,3-diene', 'buta-1,3-diene'...]
```
#### Validation SMILES to IUPAC translations
It's possible to validate the translations by reverse translation into IUPAC
and calculating Tanimoto similarity of two molecules fingerprints.
````python
from chemicalconverters import NamesConverter
converter = NamesConverter(model_name="knowledgator/SMILES2IUPAC-canonical-base")
print(converter.smiles_to_iupac('CCO', validate=True))
````
````text
['ethanol'] 1.0
````
The larger is Tanimoto similarity, the larger is probability, that the prediction was correct.
You can also process validation manually:
```python
from chemicalconverters import NamesConverter
validation_model = NamesConverter(model_name="knowledgator/IUPAC2SMILES-canonical-base")
print(NamesConverter.validate_iupac(input_sequence='CCO', predicted_sequence='CCO', validation_model=validation_model))
```
```text
1.0
```
## Bias, Risks, and Limitations
This model has limited accuracy in processing large molecules and currently, doesn't support isomeric and isotopic SMILES.
### Training Procedure
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
The model was trained on 100M examples of SMILES-IUPAC pairs with lr=0.00001, batch_size=512 for 2 epochs.
## Evaluation
| Model | Accuracy | BLEU-4 score | Size(MB) |
|-------------------------------------|---------|------------------|----------|
| SMILES2IUPAC-canonical-small |75% |0.93 |23 |
| SMILES2IUPAC-canonical-base |86.9% |0.964 |180 |
| STOUT V2.0\* |66.65% |0.92 |128 |
| STOUT V2.0 (according to our tests) | |0.89 |128 |
*According to the original paper https://jcheminf.biomedcentral.com/articles/10.1186/s13321-021-00512-4
## Citation
Coming soon.
## Model Card Authors
[Mykhailo Shtopko](https://huggingface.co/BioMike)
## Model Card Contact
info@knowledgator.com |