This repo contains the model and the notebook for fine-tuning BERT model on SNLI Corpus for Semantic Similarity. Drug Molecule Generation with VAE.
Full credits go to Victor Basu
Reproduced by Vu Minh Chien
Motivation: Using a Variational Autoencoder to generate molecules for drug discovery. Automatic chemical design using a data-driven continuous representation of molecules generates new molecules via efficient exploration of open-ended spaces of chemical compounds. The model consists of three components: Encoder, Decoder, and Predictor. The Encoder converts the discrete representation of a molecule into a real-valued continuous vector, and the Decoder converts these continuous vectors back to discrete molecule representations. The Predictor estimates chemical properties from the latent continuous vector representation of the molecule. Continuous representations allow the use of gradient-based optimization to efficiently guide the search for optimized functional compounds.
In this example, RDKit is used to conveniently and efficiently transform SMILES into molecule objects, and then from those obtain sets of atoms and bonds. SMILES expresses the structure of a given molecule in the form of an ASCII string. The SMILES string is a compact encoding that, for smaller molecules, is relatively human-readable. Encoding molecules as a string both alleviates and facilitates database and/or web searching of a given molecule. RDKit uses algorithms to accurately transform a given SMILES to a molecule object, which can then be used to compute a great number of molecular properties/features.
The ZINC – A Free Database of Commercially Available Compounds for Virtual Screening dataset was used in this tutorial. The dataset comes with molecule formula in SMILE representation along with their respective molecular properties such as logP (water–octanal partition coefficient), SAS (synthetic accessibility score), and QED (Qualitative Estimate of Drug-likeness).
Latent spaces samples
- Downloads last month