knowledgator
/

SMILES2IUPAC-canonical-base

 ---
 license: apache-2.0
+metrics:
+- accuracy
+- bleu
+pipeline_tag: text2text-generation
+tags:
+- chemistry
+- biology
+- medical
+- smiles
+- iupac
+- text-generation-inference
+widget:
+- text: CCO
+  example_title: ethanol
 ---
+# SMILES2IUPAC-canonical-small
+SMILES2IUPAC-canonical-small was designed to accurately translate SMILES chemical names to IUPAC standards.
+## Model Details
+### Model Description
+SMILES2IUPAC-canonical-small is based on the MT5 model with optimizations in implementing different tokenizers for the encoder and decoder.
+- **Developed by:** Knowladgator Engineering
+- **Model type:** Encoder-Decoder with attention mechanism
+- **Language(s) (NLP):** SMILES, IUPAC (English)
+- **License:** Apache License 2.0
+### Model Sources
+- **Paper:** coming soon
+- **Demo:** [ChemicalConverters](https://huggingface.co/spaces/knowledgator/ChemicalConverters)
+## Quickstart
+Firstly, install the library:
+```commandline
+pip install chemical-converters
+```
+### SMILES to IUPAC
+#### ! Preferred IUPAC style
+To choose the preferred IUPAC style, place style tokens before
+your SMILES sequence.
+| Style Token | Description                                                                                        |
+|-------------|----------------------------------------------------------------------------------------------------|
+| `<BASE>`    | The most known name of the substance, sometimes is the mixture of traditional and systematic style |
+| `<SYST>`    | The totally systematic style without trivial names                                                 |
+| `<TRAD>`    | The style is based on trivial names of the parts of substances                                     |
+#### To perform simple translation, follow the example:
+```python
+from chemicalconverters import NamesConverter
+converter = NamesConverter(model_name="knowledgator/SMILES2IUPAC-canonical-small")
+print(converter.smiles_to_iupac('CCO'))
+print(converter.smiles_to_iupac(['<SYST>CCO', '<TRAD>CCO', '<BASE>CCO']))
+```
+```text
+['ethanol']
+['ethanol', 'ethanol', 'ethanol']
+```
+#### Processing in batches:
+```python
+from chemicalconverters import NamesConverter
+converter = NamesConverter(model_name="knowledgator/SMILES2IUPAC-canonical-small")
+print(converter.smiles_to_iupac(["<BASE>C=CC=C" for _ in range(10)], num_beams=1,
+                                process_in_batch=True, batch_size=1000))
+```
+```text
+['buta-1,3-diene', 'buta-1,3-diene'...]
+```
+#### Validation SMILES to IUPAC translations
+It's possible to validate the translations by reverse translation into IUPAC
+and calculating Tanimoto similarity of two molecules fingerprints.
+````python
+from chemicalconverters import NamesConverter
+converter = NamesConverter(model_name="knowledgator/SMILES2IUPAC-canonical-small")
+print(converter.smiles_to_iupac('CCO', validate=True))
+````
+````text
+['ethanol'] 1.0
+````
+The larger is Tanimoto similarity, the larger is probability, that the prediction was correct.
+You can also process validation manually:
+```python
+from chemicalconverters import NamesConverter
+validation_model = NamesConverter(model_name="knowledgator/IUPAC2SMILES-canonical-base")
+print(NamesConverter.validate_iupac(input_sequence='CCO', predicted_sequence='CCO', validation_model=validation_model))
+```
+```text
+1.0
+```
+## Bias, Risks, and Limitations
+This model has limited accuracy in processing large molecules and currently, doesn't support isomeric and isotopic SMILES.
+### Training Procedure
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+The model was trained on 100M examples of SMILES-IUPAC pairs with lr=0.0003, batch_size=1024 for 2 epochs.
+## Evaluation
+| Model                               | Accuracy | BLEU-4 score | Size(MB) |
+|-------------------------------------|---------|------------------|----------|
+| SMILES2IUPAC-canonical-small        |75%| 0.93                 | 23       |
+| SMILES2IUPAC-canonical-base         |86.9%|0.964|180|
+| STOUT V2.0\*                        | 66.65%  | 0.92                 | 128      |
+| STOUT V2.0 (according to our tests) |         | 0.89                 | 128      |
+*According to the original paper https://jcheminf.biomedcentral.com/articles/10.1186/s13321-021-00512-4
+## Citation
+Coming soon.
+## Model Card Authors
+[Mykhailo Shtopko](https://huggingface.co/BioMike)
+## Model Card Contact
+info@knowledgator.com