--- license: apache-2.0 metrics: - accuracy - bleu pipeline_tag: text2text-generation tags: - chemistry - biology - medical - smiles - iupac - text-generation-inference widget: - text: CCO example_title: ethanol --- # SMILES2IUPAC-canonical-base SMILES2IUPAC-canonical-base was designed to accurately translate SMILES chemical names to IUPAC standards. ## Model Details ### Model Description SMILES2IUPAC-canonical-base is based on the MT5 model with optimizations in implementing different tokenizers for the encoder and decoder. - **Developed by:** Knowladgator Engineering - **Model type:** Encoder-Decoder with attention mechanism - **Language(s) (NLP):** SMILES, IUPAC (English) - **License:** Apache License 2.0 ### Model Sources - **Paper:** coming soon - **Demo:** [ChemicalConverters](https://huggingface.co/spaces/knowledgator/ChemicalConverters) ## Quickstart Firstly, install the library: ```commandline pip install chemical-converters ``` ### SMILES to IUPAC #### ! Preferred IUPAC style To choose the preferred IUPAC style, place style tokens before your SMILES sequence. | Style Token | Description | |-------------|----------------------------------------------------------------------------------------------------| | `` | The most known name of the substance, sometimes is the mixture of traditional and systematic style | | `` | The totally systematic style without trivial names | | `` | The style is based on trivial names of the parts of substances | #### To perform simple translation, follow the example: ```python from chemicalconverters import NamesConverter converter = NamesConverter(model_name="knowledgator/SMILES2IUPAC-canonical-base") print(converter.smiles_to_iupac('CCO')) print(converter.smiles_to_iupac(['CCO', 'CCO', 'CCO'])) ``` ```text ['ethanol'] ['ethanol', 'ethanol', 'ethanol'] ``` #### Processing in batches: ```python from chemicalconverters import NamesConverter converter = NamesConverter(model_name="knowledgator/SMILES2IUPAC-canonical-base") print(converter.smiles_to_iupac(["C=CC=C" for _ in range(10)], num_beams=1, process_in_batch=True, batch_size=1000)) ``` ```text ['buta-1,3-diene', 'buta-1,3-diene'...] ``` #### Validation SMILES to IUPAC translations It's possible to validate the translations by reverse translation into IUPAC and calculating Tanimoto similarity of two molecules fingerprints. ````python from chemicalconverters import NamesConverter converter = NamesConverter(model_name="knowledgator/SMILES2IUPAC-canonical-base") print(converter.smiles_to_iupac('CCO', validate=True)) ```` ````text ['ethanol'] 1.0 ```` The larger is Tanimoto similarity, the larger is probability, that the prediction was correct. You can also process validation manually: ```python from chemicalconverters import NamesConverter validation_model = NamesConverter(model_name="knowledgator/IUPAC2SMILES-canonical-base") print(NamesConverter.validate_iupac(input_sequence='CCO', predicted_sequence='CCO', validation_model=validation_model)) ``` ```text 1.0 ``` ## Bias, Risks, and Limitations This model has limited accuracy in processing large molecules and currently, doesn't support isomeric and isotopic SMILES. ### Training Procedure The model was trained on 100M examples of SMILES-IUPAC pairs with lr=0.00001, batch_size=512 for 2 epochs. ## Evaluation | Model | Accuracy | BLEU-4 score | Size(MB) | |-------------------------------------|---------|------------------|----------| | SMILES2IUPAC-canonical-small |75% |0.93 |23 | | SMILES2IUPAC-canonical-base |86.9% |0.964 |180 | | STOUT V2.0\* |66.65% |0.92 |128 | | STOUT V2.0 (according to our tests) | |0.89 |128 | *According to the original paper https://jcheminf.biomedcentral.com/articles/10.1186/s13321-021-00512-4 ## Citation Coming soon. ## Model Card Authors [Mykhailo Shtopko](https://huggingface.co/BioMike) ## Model Card Contact info@knowledgator.com