BioMike's picture
Update README.md
035c811 verified
metadata
license: apache-2.0
metrics:
  - accuracy
  - bleu
pipeline_tag: text2text-generation
tags:
  - chemistry
  - biology
  - medical
  - smiles
  - iupac
  - text-generation-inference
widget:
  - text: CCO
    example_title: ethanol

SMILES2IUPAC-canonical-base

SMILES2IUPAC-canonical-base was designed to accurately translate SMILES chemical names to IUPAC standards.

Model Details

Model Description

SMILES2IUPAC-canonical-base is based on the MT5 model with optimizations in implementing different tokenizers for the encoder and decoder.

  • Developed by: Knowladgator Engineering
  • Model type: Encoder-Decoder with attention mechanism
  • Language(s) (NLP): SMILES, IUPAC (English)
  • License: Apache License 2.0

Model Sources

Quickstart

Firstly, install the library:

pip install chemical-converters

SMILES to IUPAC

! Preferred IUPAC style

To choose the preferred IUPAC style, place style tokens before your SMILES sequence.

Style Token Description
<BASE> The most known name of the substance, sometimes is the mixture of traditional and systematic style
<SYST> The totally systematic style without trivial names
<TRAD> The style is based on trivial names of the parts of substances

To perform simple translation, follow the example:

from chemicalconverters import NamesConverter

converter = NamesConverter(model_name="knowledgator/SMILES2IUPAC-canonical-base")
print(converter.smiles_to_iupac('CCO'))
print(converter.smiles_to_iupac(['<SYST>CCO', '<TRAD>CCO', '<BASE>CCO']))
['ethanol']
['ethanol', 'ethanol', 'ethanol']

Processing in batches:

from chemicalconverters import NamesConverter

converter = NamesConverter(model_name="knowledgator/SMILES2IUPAC-canonical-base")
print(converter.smiles_to_iupac(["<BASE>C=CC=C" for _ in range(10)], num_beams=1, 
                                process_in_batch=True, batch_size=1000))
['buta-1,3-diene', 'buta-1,3-diene'...]

Validation SMILES to IUPAC translations

It's possible to validate the translations by reverse translation into IUPAC and calculating Tanimoto similarity of two molecules fingerprints.

from chemicalconverters import NamesConverter

converter = NamesConverter(model_name="knowledgator/SMILES2IUPAC-canonical-base")
print(converter.smiles_to_iupac('CCO', validate=True))
['ethanol'] 1.0

The larger is Tanimoto similarity, the larger is probability, that the prediction was correct.

You can also process validation manually:

from chemicalconverters import NamesConverter

validation_model = NamesConverter(model_name="knowledgator/IUPAC2SMILES-canonical-base")
print(NamesConverter.validate_iupac(input_sequence='CCO', predicted_sequence='CCO', validation_model=validation_model))
1.0

Bias, Risks, and Limitations

This model has limited accuracy in processing large molecules and currently, doesn't support isomeric and isotopic SMILES.

Training Procedure

The model was trained on 100M examples of SMILES-IUPAC pairs with lr=0.00001, batch_size=512 for 2 epochs.

Evaluation

Model Accuracy BLEU-4 score Size(MB)
SMILES2IUPAC-canonical-small 75% 0.93 23
SMILES2IUPAC-canonical-base 86.9% 0.964 180
STOUT V2.0* 66.65% 0.92 128
STOUT V2.0 (according to our tests) 0.89 128
*According to the original paper https://jcheminf.biomedcentral.com/articles/10.1186/s13321-021-00512-4

Citation

Coming soon.

Model Card Authors

Mykhailo Shtopko

Model Card Contact

info@knowledgator.com