Chemical Converters
Collection
Collection of models for converting chemical formats between each other.
•
6 items
•
Updated
•
1
IUPAC2SMILES-canonical-base was designed to accurately translate IUPAC chemical names to SMILES.
IUPAC2SMILES-canonical-base is based on the MT5 model with optimizations in implementing different tokenizers for the encoder and decoder.
Firstly, install the library:
pip install chemical-converters
from chemicalconverters import NamesConverter
converter = NamesConverter(model_name="knowledgator/IUPAC2SMILES-canonical-base")
print(converter.iupac_to_smiles('ethanol'))
print(converter.iupac_to_smiles(['ethanol', 'ethanol', 'ethanol']))
['CCO']
['CCO', 'CCO', 'CCO']
from chemicalconverters import NamesConverter
converter = NamesConverter(model_name="knowledgator/IUPAC2SMILES-canonical-base")
print(converter.iupac_to_smiles(["buta-1,3-diene" for _ in range(10)], num_beams=1,
process_in_batch=True, batch_size=1000))
['<SYST>C=CC=C', '<SYST>C=CC=C'...]
Our models also predict IUPAC styles from the table:
Style Token | Description |
---|---|
<BASE> |
The most known name of the substance, sometimes is the mixture of traditional and systematic style |
<SYST> |
The totally systematic style without trivial names |
<TRAD> |
The style is based on trivial names of the parts of substances |
This model has limited accuracy in processing large molecules and currently, doesn't support isomeric and isotopic SMILES.
The model was trained on 100M examples of SMILES-IUPAC pairs with lr=0.00001, batch_size=512 for 2 epochs.
Model | Accuracy | BLEU-4 score | Size(MB) |
---|---|---|---|
IUPAC2SMILES-canonical-small | 88.9% | 0.966 | 23 |
IUPAC2SMILES-canonical-base | 93.7% | 0.974 | 180 |
STOUT V2.0* | 68.47% | 0.92 | 128 |
*According to the original paper https://jcheminf.biomedcentral.com/articles/10.1186/s13321-021-00512-4 |
Coming soon.