File size: 2,569 Bytes

c2cec9b
8128342
c2cec9b
 
 
 
1cdbe3e
 
 
 
 
 
 
 
 
c2cec9b
 
 
1cdbe3e
c2cec9b
 
 
1cdbe3e
c2cec9b
b8a29b9
c2cec9b
 
 
 
 
 
1cdbe3e
 
 
 
55d1954
0df9843
c23a764
 
a5e9c42
0df9843
c23a764
3af2940
1cdbe3e
c23a764

---
library_name: tf-keras
---

## Model description

This repo contains the model and the notebook for fine-tuning BERT model on SNLI Corpus for Semantic Similarity. [Drug Molecule Generation with VAE](https://keras.io/examples/generative/molecule_generation/).

Full credits go to [Victor Basu](https://www.linkedin.com/in/victor-basu-520958147/)

Reproduced by [Vu Minh Chien](https://www.linkedin.com/in/vumichien/)

Motivation: Using a Variational Autoencoder to generate molecules for drug discovery. Automatic chemical design using a data-driven continuous representation of molecules generates new molecules via efficient exploration of open-ended spaces of chemical compounds. The model consists of three components: Encoder, Decoder, and Predictor. The Encoder converts the discrete representation of a molecule into a real-valued continuous vector, and the Decoder converts these continuous vectors back to discrete molecule representations. The Predictor estimates chemical properties from the latent continuous vector representation of the molecule. Continuous representations allow the use of gradient-based optimization to efficiently guide the search for optimized functional compounds.

![intro](https://bit.ly/3CtPMzM)

## Intended uses & limitations

In this example, RDKit is used to conveniently and efficiently transform SMILES into molecule objects, and then from those obtain sets of atoms and bonds. SMILES expresses the structure of a given molecule in the form of an ASCII string. The SMILES string is a compact encoding that, for smaller molecules, is relatively human-readable. Encoding molecules as a string both alleviates and facilitates database and/or web searching of a given molecule. RDKit uses algorithms to accurately transform a given SMILES to a molecule object, which can then be used to compute a great number of molecular properties/features.

## Training and evaluation data

The ZINC – A Free Database of Commercially Available Compounds for Virtual Screening dataset was used in this tutorial. The dataset comes with molecule formula in SMILE representation along with their respective molecular properties such as logP (water–octanal partition coefficient), SAS (synthetic accessibility score), and QED (Qualitative Estimate of Drug-likeness).

## Model Plot

<details>
<summary>View Model Plot</summary>

![Model Image](./model.png)

</details>

## Output samples

Latent spaces samples

![Latent spaces](./latent_space_clusters.png)

<details>
<summary>View samples</summary>

![Samples](./samples.png)

</details>