knowledgator
/

IUPAC2SMILES-canonical-base

Text2Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

IUPAC2SMILES-canonical-base / README.md

BioMike's picture

Update README.md

a41a732 verified 8 months ago

|

history blame contribute delete

No virus

3.5 kB

	---
	license: apache-2.0
	metrics:
	- accuracy
	- bleu
	pipeline_tag: text2text-generation
	tags:
	- chemistry
	- biology
	- medical
	- smiles
	- iupac
	- text-generation-inference
	widget:
	- text: ethanol
	example_title: CCO
	---
	# IUPAC2SMILES-canonical-base

	IUPAC2SMILES-canonical-base was designed to accurately translate IUPAC chemical names to SMILES.

	## Model Details

	### Model Description

	IUPAC2SMILES-canonical-base is based on the MT5 model with optimizations in implementing different tokenizers for the encoder and decoder.
	- Developed by: Knowladgator Engineering
	- Model type: Encoder-Decoder with attention mechanism
	- Language(s) (NLP): SMILES, IUPAC (English)
	- License: Apache License 2.0

	### Model Sources
	- Paper: coming soon
	- Demo: [ChemicalConverters](https://huggingface.co/spaces/knowledgator/ChemicalConverters)

	## Quickstart
	Firstly, install the library:
	```commandline
	pip install chemical-converters
	```
	### IUPAC to SMILES
	#### To perform simple translation, follow the example:
	```python
	from chemicalconverters import NamesConverter

	converter = NamesConverter(model_name="knowledgator/IUPAC2SMILES-canonical-base")
	print(converter.iupac_to_smiles('ethanol'))
	print(converter.iupac_to_smiles(['ethanol', 'ethanol', 'ethanol']))
	```
	```text
	['CCO']
	['CCO', 'CCO', 'CCO']
	```
	#### Processing in batches:
	```python
	from chemicalconverters import NamesConverter

	converter = NamesConverter(model_name="knowledgator/IUPAC2SMILES-canonical-base")
	print(converter.iupac_to_smiles(["buta-1,3-diene" for _ in range(10)], num_beams=1,
	process_in_batch=True, batch_size=1000))
	```
	```text
	['<SYST>C=CC=C', '<SYST>C=CC=C'...]
	```
	Our models also predict IUPAC styles from the table:

	\| Style Token \| Description \|
	\|-------------\|----------------------------------------------------------------------------------------------------\|
	\| `<BASE>` \| The most known name of the substance, sometimes is the mixture of traditional and systematic style \|
	\| `<SYST>` \| The totally systematic style without trivial names \|
	\| `<TRAD>` \| The style is based on trivial names of the parts of substances \|

	## Bias, Risks, and Limitations

	This model has limited accuracy in processing large molecules and currently, doesn't support isomeric and isotopic SMILES.

	### Training Procedure

	<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

	The model was trained on 100M examples of SMILES-IUPAC pairs with lr=0.00001, batch_size=512 for 2 epochs.

	## Evaluation

	\| Model \| Accuracy \| BLEU-4 score \| Size(MB) \|
	\|-------------------------------------\|---------\|------------------\|----------\|
	\| IUPAC2SMILES-canonical-small \|88.9% \|0.966 \|23 \|
	\| IUPAC2SMILES-canonical-base \|93.7% \|0.974 \|180 \|
	\| STOUT V2.0\* \|68.47% \|0.92 \|128 \|
	*According to the original paper https://jcheminf.biomedcentral.com/articles/10.1186/s13321-021-00512-4

	## Citation
	Coming soon.

	## Model Card Authors

	[Mykhailo Shtopko](https://huggingface.co/BioMike)

	## Model Card Contact

	info@knowledgator.com