license: mit
tags:
- chemistry
- smiles
widget:
- text: ^
example_title: Sample molecule | SMILES
Model Card for Model hogru/MolReactGen-GuacaMol-Molecules
MolReactGen is a model that generates molecules in SMILES format (this model) and reaction templates in SMARTS format.
Model Details
Model Description
MolReactGen is based on the the GPT-2 transformer decoder architecture and has been trained on the GuacaMol dataset. More information can be found in these introductory slides.
- Developed by: Stephan Holzgruber
- Model type: Transformer decoder
- License: MIT
Model Sources
- Repository: https://github.com/hogru/MolReactGen
- Presentation: https://github.com/hogru/MolReactGen/blob/main/presentations/Slides%20(A4%20size).pdf
- Poster: https://github.com/hogru/MolReactGen/blob/main/presentations/Poster%20(A0%20size).pdf
Uses
The main use of this model is to pass the master's examination of the author ;-)
Direct Use
The model can be used in a Hugging Face text generation pipeline. For the intended use case a wrapper around the raw text generation pipeline is needed. This is the generate.py
from the repository.
The model has a default GenerationConfig()
(generation_config.json
) which can be overwritten. Depending on the number of molecules to be generated (num_return_sequences
in the JSON
file) this might take a while. The generation code above shows a progress bar during generation.
Bias, Risks, and Limitations
The model generates molecules that are similar to the GuacaMol training data, which itself is based on ChEMBL. Any checks of the molecules, e.g. chemical feasiblitly, must be adressed by the user of the model.
Training Details
Training Data
Training Procedure
The default Hugging Face Trainer()
has been used, with an EarlyStoppingCallback()
.
Preprocessing
The training data was pre-processed with a PreTrainedTokenizerFast()
trained on the training data with a character level pre-tokenizer and Unigram as the sub-word tokenization algorithm with a vocabulary size of 88. Other tokenizers can be configured.
Training Hyperparameters
- Batch size: 64
- Gradient accumulation steps: 4
- Mixed precision: fp16, native amp
- Learning rate: 0.0025
- Learning rate scheduler: Cosine
- Learning rate scheduler warmup: 0.1
- Optimizer: AdamW with betas=(0.9,0.95) and epsilon=1e-08
- Number of epochs: 50
More configuration (options) can be found in the conf
directory of the repository.
Evaluation
Please see the slides / the poster mentioned above.
Metrics
Please see the slides / the poster mentioned above.
Results
Please see the slides / the poster mentioned above.
Technical Specifications
Framework versions
- Transformers 4.27.1
- Pytorch 1.13.1
- Datasets 2.10.1
- Tokenizers 0.13.2
Hardware
- Local PC running Ubuntu 22.04
- NVIDIA GEFORCE RTX 3080Ti (12GB)