|
--- |
|
license: apache-2.0 |
|
metrics: |
|
- accuracy |
|
- bleu |
|
pipeline_tag: text2text-generation |
|
tags: |
|
- chemistry |
|
- biology |
|
- medical |
|
- smiles |
|
- iupac |
|
- text-generation-inference |
|
widget: |
|
- text: CCO |
|
example_title: ethanol |
|
--- |
|
# SMILES2IUPAC-canonical-small |
|
|
|
SMILES2IUPAC-canonical-small was designed to accurately translate SMILES chemical names to IUPAC standards. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
SMILES2IUPAC-canonical-small is based on the MT5 model with optimizations in implementing different tokenizers for the encoder and decoder. |
|
- **Developed by:** Knowladgator Engineering |
|
- **Model type:** Encoder-Decoder with attention mechanism |
|
- **Language(s) (NLP):** SMILES, IUPAC (English) |
|
- **License:** Apache License 2.0 |
|
|
|
### Model Sources |
|
- **Paper:** coming soon |
|
- **Demo:** [ChemicalConverters](https://huggingface.co/spaces/knowledgator/ChemicalConverters) |
|
|
|
## Quickstart |
|
Firstly, install the library: |
|
```commandline |
|
pip install chemical-converters |
|
``` |
|
### SMILES to IUPAC |
|
#### ! Preferred IUPAC style |
|
To choose the preferred IUPAC style, place style tokens before |
|
your SMILES sequence. |
|
|
|
| Style Token | Description | |
|
|-------------|----------------------------------------------------------------------------------------------------| |
|
| `<BASE>` | The most known name of the substance, sometimes is the mixture of traditional and systematic style | |
|
| `<SYST>` | The totally systematic style without trivial names | |
|
| `<TRAD>` | The style is based on trivial names of the parts of substances | |
|
|
|
#### To perform simple translation, follow the example: |
|
```python |
|
from chemicalconverters import NamesConverter |
|
|
|
converter = NamesConverter(model_name="knowledgator/SMILES2IUPAC-canonical-small") |
|
print(converter.smiles_to_iupac('CCO')) |
|
print(converter.smiles_to_iupac(['<SYST>CCO', '<TRAD>CCO', '<BASE>CCO'])) |
|
``` |
|
```text |
|
['ethanol'] |
|
['ethanol', 'ethanol', 'ethanol'] |
|
``` |
|
#### Processing in batches: |
|
```python |
|
from chemicalconverters import NamesConverter |
|
|
|
converter = NamesConverter(model_name="knowledgator/SMILES2IUPAC-canonical-small") |
|
print(converter.smiles_to_iupac(["<BASE>C=CC=C" for _ in range(10)], num_beams=1, |
|
process_in_batch=True, batch_size=1000)) |
|
``` |
|
```text |
|
['buta-1,3-diene', 'buta-1,3-diene'...] |
|
``` |
|
#### Validation SMILES to IUPAC translations |
|
It's possible to validate the translations by reverse translation into IUPAC |
|
and calculating Tanimoto similarity of two molecules fingerprints. |
|
````python |
|
from chemicalconverters import NamesConverter |
|
|
|
converter = NamesConverter(model_name="knowledgator/SMILES2IUPAC-canonical-small") |
|
print(converter.smiles_to_iupac('CCO', validate=True)) |
|
```` |
|
````text |
|
['ethanol'] 1.0 |
|
```` |
|
The larger is Tanimoto similarity, the larger is probability, that the prediction was correct. |
|
|
|
You can also process validation manually: |
|
```python |
|
from chemicalconverters import NamesConverter |
|
|
|
validation_model = NamesConverter(model_name="knowledgator/IUPAC2SMILES-canonical-base") |
|
print(NamesConverter.validate_iupac(input_sequence='CCO', predicted_sequence='CCO', validation_model=validation_model)) |
|
``` |
|
```text |
|
1.0 |
|
``` |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
This model has limited accuracy in processing large molecules and currently, doesn't support isomeric and isotopic SMILES. |
|
|
|
### Training Procedure |
|
|
|
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
|
|
|
The model was trained on 100M examples of SMILES-IUPAC pairs with lr=0.0003, batch_size=1024 for 2 epochs. |
|
|
|
## Evaluation |
|
|
|
| Model | Accuracy | BLEU-4 score | Size(MB) | |
|
|-------------------------------------|---------|------------------|----------| |
|
| SMILES2IUPAC-canonical-small |75%| 0.93 | 23 | |
|
| SMILES2IUPAC-canonical-base |86.9%|0.964|180| |
|
| STOUT V2.0\* | 66.65% | 0.92 | 128 | |
|
| STOUT V2.0 (according to our tests) | | 0.89 | 128 | |
|
*According to the original paper https://jcheminf.biomedcentral.com/articles/10.1186/s13321-021-00512-4 |
|
|
|
## Citation |
|
Coming soon. |
|
|
|
## Model Card Authors |
|
|
|
[Mykhailo Shtopko](https://huggingface.co/BioMike) |
|
|
|
## Model Card Contact |
|
|
|
info@knowledgator.com |