Spaces:

MCP-1st-Birthday
/

ML-Starter

Running

App Files Files Community

ML-Starter / knowledge_base /generative /molecule_generation.py

emreatilgan

feat: Initialize mcp_server with embedding and loader modules

9ce984a 16 days ago

raw

history blame contribute delete

23.1 kB

	"""
	Title: Drug Molecule Generation with VAE
	Author: [Victor Basu](https://www.linkedin.com/in/victor-basu-520958147)
	Date created: 2022/03/10
	Last modified: 2024/12/17
	Description: Implementing a Convolutional Variational AutoEncoder (VAE) for Drug Discovery.
	Accelerator: GPU
	"""

	"""
	## Introduction

	In this example, we use a Variational Autoencoder to generate molecules for drug discovery.
	We use the research papers
	[Automatic chemical design using a data-driven continuous representation of molecules](https://arxiv.org/abs/1610.02415)
	and [MolGAN: An implicit generative model for small molecular graphs](https://arxiv.org/abs/1805.11973)
	as a reference.

	The model described in the paper **Automatic chemical design using a data-driven
	continuous representation of molecules** generates new molecules via efficient exploration
	of open-ended spaces of chemical compounds. The model consists of
	three components: Encoder, Decoder and Predictor. The Encoder converts the discrete
	representation of a molecule into a real-valued continuous vector, and the Decoder
	converts these continuous vectors back to discrete molecule representations. The
	Predictor estimates chemical properties from the latent continuous vector representation
	of the molecule. Continuous representations allow the use of gradient-based
	optimization to efficiently guide the search for optimized functional compounds.

	![intro](https://bit.ly/3CtPMzM)

	Figure (a) - A diagram of the autoencoder used for molecule design, including the
	joint property prediction model. Starting from a discrete molecule representation, such
	as a SMILES string, the encoder network converts each molecule into a vector in the
	latent space, which is effectively a continuous molecule representation. Given a point
	in the latent space, the decoder network produces a corresponding SMILES string. A
	multilayer perceptron network estimates the value of target properties associated with
	each molecule.

	Figure (b) - Gradient-based optimization in continuous latent space. After training a
	surrogate model `f(z)` to predict the properties of molecules based on their latent
	representation `z`, we can optimize `f(z)` with respect to `z` to find new latent
	representations expected to match specific desired properties. These new latent
	representations can then be decoded into SMILES strings, at which point their properties
	can be tested empirically.

	For an explanation and implementation of MolGAN, please refer to the Keras Example
	[WGAN-GP with R-GCN for the generation of small molecular graphs](https://bit.ly/3pU6zXK) by
	Alexander Kensert. Many of the functions used in the present example are from the above Keras example.
	"""

	"""
	## Setup

	RDKit is an open source toolkit for cheminformatics and machine learning. This toolkit come in handy
	if one is into drug discovery domain. In this example, RDKit is used to conveniently
	and efficiently transform SMILES to molecule objects, and then from those obtain sets of atoms
	and bonds.

	Quoting from
	[WGAN-GP with R-GCN for the generation of small molecular graphs](https://keras.io/examples/generative/wgan-graphs/)):

	**"SMILES expresses the structure of a given molecule in the form of an ASCII string.
	The SMILES string is a compact encoding which, for smaller molecules, is relatively human-readable.
	Encoding molecules as a string both alleviates and facilitates database and/or web searching
	of a given molecule. RDKit uses algorithms to accurately transform a given SMILES to
	a molecule object, which can then be used to compute a great number of molecular properties/features."**
	"""

	"""shell
	pip -q install rdkit-pypi==2021.9.4
	"""

	import os

	os.environ["KERAS_BACKEND"] = "tensorflow"

	import ast

	import pandas as pd
	import numpy as np

	import tensorflow as tf
	import keras
	from keras import layers
	from keras import ops

	import matplotlib.pyplot as plt
	from rdkit import Chem, RDLogger
	from rdkit.Chem import BondType
	from rdkit.Chem.Draw import MolsToGridImage

	RDLogger.DisableLog("rdApp.*")

	"""
	## Dataset

	We use the [**ZINC – A Free Database of Commercially Available Compounds for
	Virtual Screening**](https://bit.ly/3IVBI4x) dataset. The dataset comes with molecule
	formula in SMILE representation along with their respective molecular properties such as
	logP (water–octanal partition coefficient), SAS (synthetic
	accessibility score) and QED (Qualitative Estimate of Drug-likeness).

	"""

	csv_path = keras.utils.get_file(
	"250k_rndm_zinc_drugs_clean_3.csv",
	"https://raw.githubusercontent.com/aspuru-guzik-group/chemical_vae/master/models/zinc_properties/250k_rndm_zinc_drugs_clean_3.csv",
	)

	df = pd.read_csv(csv_path)
	df["smiles"] = df["smiles"].apply(lambda s: s.replace("\n", ""))
	df.head()

	"""
	## Hyperparameters
	"""

	SMILE_CHARSET = '["C", "B", "F", "I", "H", "O", "N", "S", "P", "Cl", "Br"]'

	bond_mapping = {"SINGLE": 0, "DOUBLE": 1, "TRIPLE": 2, "AROMATIC": 3}
	bond_mapping.update(
	{0: BondType.SINGLE, 1: BondType.DOUBLE, 2: BondType.TRIPLE, 3: BondType.AROMATIC}
	)
	SMILE_CHARSET = ast.literal_eval(SMILE_CHARSET)

	MAX_MOLSIZE = max(df["smiles"].str.len())
	SMILE_to_index = dict((c, i) for i, c in enumerate(SMILE_CHARSET))
	index_to_SMILE = dict((i, c) for i, c in enumerate(SMILE_CHARSET))
	atom_mapping = dict(SMILE_to_index)
	atom_mapping.update(index_to_SMILE)

	BATCH_SIZE = 100
	EPOCHS = 10

	VAE_LR = 5e-4
	NUM_ATOMS = 120 # Maximum number of atoms

	ATOM_DIM = len(SMILE_CHARSET) # Number of atom types
	BOND_DIM = 4 + 1 # Number of bond types
	LATENT_DIM = 435 # Size of the latent space


	def smiles_to_graph(smiles):
	# Converts SMILES to molecule object
	molecule = Chem.MolFromSmiles(smiles)

	# Initialize adjacency and feature tensor
	adjacency = np.zeros((BOND_DIM, NUM_ATOMS, NUM_ATOMS), "float32")
	features = np.zeros((NUM_ATOMS, ATOM_DIM), "float32")

	# loop over each atom in molecule
	for atom in molecule.GetAtoms():
	i = atom.GetIdx()
	atom_type = atom_mapping[atom.GetSymbol()]
	features[i] = np.eye(ATOM_DIM)[atom_type]
	# loop over one-hop neighbors
	for neighbor in atom.GetNeighbors():
	j = neighbor.GetIdx()
	bond = molecule.GetBondBetweenAtoms(i, j)
	bond_type_idx = bond_mapping[bond.GetBondType().name]
	adjacency[bond_type_idx, [i, j], [j, i]] = 1

	# Where no bond, add 1 to last channel (indicating "non-bond")
	# Notice: channels-first
	adjacency[-1, np.sum(adjacency, axis=0) == 0] = 1

	# Where no atom, add 1 to last column (indicating "non-atom")
	features[np.where(np.sum(features, axis=1) == 0)[0], -1] = 1

	return adjacency, features


	def graph_to_molecule(graph):
	# Unpack graph
	adjacency, features = graph

	# RWMol is a molecule object intended to be edited
	molecule = Chem.RWMol()

	# Remove "no atoms" & atoms with no bonds
	keep_idx = np.where(
	(np.argmax(features, axis=1) != ATOM_DIM - 1)
	& (np.sum(adjacency[:-1], axis=(0, 1)) != 0)
	)[0]
	features = features[keep_idx]
	adjacency = adjacency[:, keep_idx, :][:, :, keep_idx]

	# Add atoms to molecule
	for atom_type_idx in np.argmax(features, axis=1):
	atom = Chem.Atom(atom_mapping[atom_type_idx])
	_ = molecule.AddAtom(atom)

	# Add bonds between atoms in molecule; based on the upper triangles
	# of the [symmetric] adjacency tensor
	(bonds_ij, atoms_i, atoms_j) = np.where(np.triu(adjacency) == 1)
	for bond_ij, atom_i, atom_j in zip(bonds_ij, atoms_i, atoms_j):
	if atom_i == atom_j or bond_ij == BOND_DIM - 1:
	continue
	bond_type = bond_mapping[bond_ij]
	molecule.AddBond(int(atom_i), int(atom_j), bond_type)

	# Sanitize the molecule; for more information on sanitization, see
	# https://www.rdkit.org/docs/RDKit_Book.html#molecular-sanitization
	flag = Chem.SanitizeMol(molecule, catchErrors=True)
	# Let's be strict. If sanitization fails, return None
	if flag != Chem.SanitizeFlags.SANITIZE_NONE:
	return None

	return molecule


	"""
	## Generate training set
	"""

	train_df = df.sample(frac=0.75, random_state=42) # random state is a seed value
	train_df.reset_index(drop=True, inplace=True)

	adjacency_tensor, feature_tensor, qed_tensor = [], [], []
	for idx in range(8000):
	adjacency, features = smiles_to_graph(train_df.loc[idx]["smiles"])
	qed = train_df.loc[idx]["qed"]
	adjacency_tensor.append(adjacency)
	feature_tensor.append(features)
	qed_tensor.append(qed)

	adjacency_tensor = np.array(adjacency_tensor)
	feature_tensor = np.array(feature_tensor)
	qed_tensor = np.array(qed_tensor)


	class RelationalGraphConvLayer(keras.layers.Layer):
	def __init__(
	self,
	units=128,
	activation="relu",
	use_bias=False,
	kernel_initializer="glorot_uniform",
	bias_initializer="zeros",
	kernel_regularizer=None,
	bias_regularizer=None,
	**kwargs
	):
	super().__init__(**kwargs)

	self.units = units
	self.activation = keras.activations.get(activation)
	self.use_bias = use_bias
	self.kernel_initializer = keras.initializers.get(kernel_initializer)
	self.bias_initializer = keras.initializers.get(bias_initializer)
	self.kernel_regularizer = keras.regularizers.get(kernel_regularizer)
	self.bias_regularizer = keras.regularizers.get(bias_regularizer)

	def build(self, input_shape):
	bond_dim = input_shape[0][1]
	atom_dim = input_shape[1][2]

	self.kernel = self.add_weight(
	shape=(bond_dim, atom_dim, self.units),
	initializer=self.kernel_initializer,
	regularizer=self.kernel_regularizer,
	trainable=True,
	name="W",
	dtype="float32",
	)

	if self.use_bias:
	self.bias = self.add_weight(
	shape=(bond_dim, 1, self.units),
	initializer=self.bias_initializer,
	regularizer=self.bias_regularizer,
	trainable=True,
	name="b",
	dtype="float32",
	)

	self.built = True

	def call(self, inputs, training=False):
	adjacency, features = inputs
	# Aggregate information from neighbors
	x = ops.matmul(adjacency, features[:, None])
	# Apply linear transformation
	x = ops.matmul(x, self.kernel)
	if self.use_bias:
	x += self.bias
	# Reduce bond types dim
	x_reduced = ops.sum(x, axis=1)
	# Apply non-linear transformation
	return self.activation(x_reduced)


	"""
	## Build the Encoder and Decoder

	The Encoder takes as input a molecule's graph adjacency matrix and feature matrix.
	These features are processed via a Graph Convolution layer, then are flattened and
	processed by several Dense layers to derive `z_mean` and `log_var`, the
	latent-space representation of the molecule.

	Graph Convolution layer: The relational graph convolution layer implements
	non-linearly transformed neighbourhood aggregations. We can define these layers as
	follows:

	`H_hat(l+1) = σ(D_hat(-1) * A_hat * H_hat*(l+1) W**(l))`

	Where `σ` denotes the non-linear transformation (commonly a ReLU activation), `A` the
	adjacency tensor, `H_hat(l)` the feature tensor at the `l-th` layer, `D_hat(-1)` the
	inverse diagonal degree tensor of `A_hat`, and `W_hat**(l)` the trainable weight tensor
	at the `l-th` layer. Specifically, for each bond type (relation), the degree tensor
	expresses, in the diagonal, the number of bonds attached to each atom.

	Source:
	[WGAN-GP with R-GCN for the generation of small molecular graphs](https://keras.io/examples/generative/wgan-graphs/))

	The Decoder takes as input the latent-space representation and predicts
	the graph adjacency matrix and feature matrix of the corresponding molecules.
	"""


	def get_encoder(
	gconv_units, latent_dim, adjacency_shape, feature_shape, dense_units, dropout_rate
	):
	adjacency = layers.Input(shape=adjacency_shape)
	features = layers.Input(shape=feature_shape)

	# Propagate through one or more graph convolutional layers
	features_transformed = features
	for units in gconv_units:
	features_transformed = RelationalGraphConvLayer(units)(
	[adjacency, features_transformed]
	)
	# Reduce 2-D representation of molecule to 1-D
	x = layers.GlobalAveragePooling1D()(features_transformed)

	# Propagate through one or more densely connected layers
	for units in dense_units:
	x = layers.Dense(units, activation="relu")(x)
	x = layers.Dropout(dropout_rate)(x)

	z_mean = layers.Dense(latent_dim, dtype="float32", name="z_mean")(x)
	log_var = layers.Dense(latent_dim, dtype="float32", name="log_var")(x)

	encoder = keras.Model([adjacency, features], [z_mean, log_var], name="encoder")

	return encoder


	def get_decoder(dense_units, dropout_rate, latent_dim, adjacency_shape, feature_shape):
	latent_inputs = keras.Input(shape=(latent_dim,))

	x = latent_inputs
	for units in dense_units:
	x = layers.Dense(units, activation="tanh")(x)
	x = layers.Dropout(dropout_rate)(x)

	# Map outputs of previous layer (x) to [continuous] adjacency tensors (x_adjacency)
	x_adjacency = layers.Dense(np.prod(adjacency_shape))(x)
	x_adjacency = layers.Reshape(adjacency_shape)(x_adjacency)
	# Symmetrify tensors in the last two dimensions
	x_adjacency = (x_adjacency + ops.transpose(x_adjacency, (0, 1, 3, 2))) / 2
	x_adjacency = layers.Softmax(axis=1)(x_adjacency)

	# Map outputs of previous layer (x) to [continuous] feature tensors (x_features)
	x_features = layers.Dense(np.prod(feature_shape))(x)
	x_features = layers.Reshape(feature_shape)(x_features)
	x_features = layers.Softmax(axis=2)(x_features)

	decoder = keras.Model(
	latent_inputs, outputs=[x_adjacency, x_features], name="decoder"
	)

	return decoder


	"""
	## Build the Sampling layer
	"""


	class Sampling(layers.Layer):
	def __init__(self, seed=None, **kwargs):
	super().__init__(**kwargs)
	self.seed_generator = keras.random.SeedGenerator(seed)

	def call(self, inputs):
	z_mean, z_log_var = inputs
	batch, dim = ops.shape(z_log_var)
	epsilon = keras.random.normal(shape=(batch, dim), seed=self.seed_generator)
	return z_mean + ops.exp(0.5 * z_log_var) * epsilon


	"""
	## Build the VAE

	This model is trained to optimize four losses:

	* Categorical crossentropy
	* KL divergence loss
	* Property prediction loss
	* Graph loss (gradient penalty)

	The categorical crossentropy loss function measures the model's
	reconstruction accuracy. The Property prediction loss estimates the mean squared
	error between predicted and actual properties after running the latent representation
	through a property prediction model. The property
	prediction of the model is optimized via binary crossentropy. The gradient
	penalty is further guided by the model's property (QED) prediction.

	A gradient penalty is an alternative soft constraint on the
	1-Lipschitz continuity as an improvement upon the gradient clipping scheme from the
	original neural network
	("1-Lipschitz continuity" means that the norm of the gradient is at most 1 at every single
	point of the function).
	It adds a regularization term to the loss function.
	"""


	class MoleculeGenerator(keras.Model):
	def __init__(self, encoder, decoder, max_len, seed=None, **kwargs):
	super().__init__(**kwargs)
	self.encoder = encoder
	self.decoder = decoder
	self.property_prediction_layer = layers.Dense(1)
	self.max_len = max_len
	self.seed_generator = keras.random.SeedGenerator(seed)
	self.sampling_layer = Sampling(seed=seed)

	self.train_total_loss_tracker = keras.metrics.Mean(name="train_total_loss")
	self.val_total_loss_tracker = keras.metrics.Mean(name="val_total_loss")

	def train_step(self, data):
	adjacency_tensor, feature_tensor, qed_tensor = data[0]
	graph_real = [adjacency_tensor, feature_tensor]
	self.batch_size = ops.shape(qed_tensor)[0]
	with tf.GradientTape() as tape:
	z_mean, z_log_var, qed_pred, gen_adjacency, gen_features = self(
	graph_real, training=True
	)
	graph_generated = [gen_adjacency, gen_features]
	total_loss = self._compute_loss(
	z_log_var, z_mean, qed_tensor, qed_pred, graph_real, graph_generated
	)

	grads = tape.gradient(total_loss, self.trainable_weights)
	self.optimizer.apply_gradients(zip(grads, self.trainable_weights))

	self.train_total_loss_tracker.update_state(total_loss)
	return {"loss": self.train_total_loss_tracker.result()}

	def _compute_loss(
	self, z_log_var, z_mean, qed_true, qed_pred, graph_real, graph_generated
	):
	adjacency_real, features_real = graph_real
	adjacency_gen, features_gen = graph_generated

	adjacency_loss = ops.mean(
	ops.sum(
	keras.losses.categorical_crossentropy(
	adjacency_real, adjacency_gen, axis=1
	),
	axis=(1, 2),
	)
	)
	features_loss = ops.mean(
	ops.sum(
	keras.losses.categorical_crossentropy(features_real, features_gen),
	axis=(1),
	)
	)
	kl_loss = -0.5 * ops.sum(
	1 + z_log_var - z_mean**2 - ops.minimum(ops.exp(z_log_var), 1e6), 1
	)
	kl_loss = ops.mean(kl_loss)

	property_loss = ops.mean(
	keras.losses.binary_crossentropy(qed_true, ops.squeeze(qed_pred, axis=1))
	)

	graph_loss = self._gradient_penalty(graph_real, graph_generated)

	return kl_loss + property_loss + graph_loss + adjacency_loss + features_loss

	def _gradient_penalty(self, graph_real, graph_generated):
	# Unpack graphs
	adjacency_real, features_real = graph_real
	adjacency_generated, features_generated = graph_generated

	# Generate interpolated graphs (adjacency_interp and features_interp)
	alpha = keras.random.uniform(shape=(self.batch_size,), seed=self.seed_generator)
	alpha = ops.reshape(alpha, (self.batch_size, 1, 1, 1))
	adjacency_interp = (adjacency_real * alpha) + (
	1.0 - alpha
	) * adjacency_generated
	alpha = ops.reshape(alpha, (self.batch_size, 1, 1))
	features_interp = (features_real * alpha) + (1.0 - alpha) * features_generated

	# Compute the logits of interpolated graphs
	with tf.GradientTape() as tape:
	tape.watch(adjacency_interp)
	tape.watch(features_interp)
	_, _, logits, _, _ = self(
	[adjacency_interp, features_interp], training=True
	)

	# Compute the gradients with respect to the interpolated graphs
	grads = tape.gradient(logits, [adjacency_interp, features_interp])
	# Compute the gradient penalty
	grads_adjacency_penalty = (1 - ops.norm(grads[0], axis=1)) ** 2
	grads_features_penalty = (1 - ops.norm(grads[1], axis=2)) ** 2
	return ops.mean(
	ops.mean(grads_adjacency_penalty, axis=(-2, -1))
	+ ops.mean(grads_features_penalty, axis=(-1))
	)

	def inference(self, batch_size):
	z = keras.random.normal(
	shape=(batch_size, LATENT_DIM), seed=self.seed_generator
	)
	reconstruction_adjacency, reconstruction_features = model.decoder.predict(z)
	# obtain one-hot encoded adjacency tensor
	adjacency = ops.argmax(reconstruction_adjacency, axis=1)
	adjacency = ops.one_hot(adjacency, num_classes=BOND_DIM, axis=1)
	# Remove potential self-loops from adjacency
	adjacency = adjacency * (1.0 - ops.eye(NUM_ATOMS, dtype="float32")[None, None])
	# obtain one-hot encoded feature tensor
	features = ops.argmax(reconstruction_features, axis=2)
	features = ops.one_hot(features, num_classes=ATOM_DIM, axis=2)
	return [
	graph_to_molecule([adjacency[i].numpy(), features[i].numpy()])
	for i in range(batch_size)
	]

	def call(self, inputs):
	z_mean, log_var = self.encoder(inputs)
	z = self.sampling_layer([z_mean, log_var])

	gen_adjacency, gen_features = self.decoder(z)

	property_pred = self.property_prediction_layer(z_mean)

	return z_mean, log_var, property_pred, gen_adjacency, gen_features


	"""
	## Train the model
	"""

	vae_optimizer = keras.optimizers.Adam(learning_rate=VAE_LR)

	encoder = get_encoder(
	gconv_units=[9],
	adjacency_shape=(BOND_DIM, NUM_ATOMS, NUM_ATOMS),
	feature_shape=(NUM_ATOMS, ATOM_DIM),
	latent_dim=LATENT_DIM,
	dense_units=[512],
	dropout_rate=0.0,
	)
	decoder = get_decoder(
	dense_units=[128, 256, 512],
	dropout_rate=0.2,
	latent_dim=LATENT_DIM,
	adjacency_shape=(BOND_DIM, NUM_ATOMS, NUM_ATOMS),
	feature_shape=(NUM_ATOMS, ATOM_DIM),
	)

	model = MoleculeGenerator(encoder, decoder, MAX_MOLSIZE)

	model.compile(vae_optimizer)
	history = model.fit([adjacency_tensor, feature_tensor, qed_tensor], epochs=EPOCHS)

	"""
	## Inference

	We use our model to generate new valid molecules from different points of the latent space.
	"""

	"""
	### Generate unique Molecules with the model
	"""

	molecules = model.inference(1000)

	MolsToGridImage(
	[m for m in molecules if m is not None][:1000], molsPerRow=5, subImgSize=(260, 160)
	)

	"""
	### Display latent space clusters with respect to molecular properties (QAE)
	"""


	def plot_latent(vae, data, labels):
	# display a 2D plot of the property in the latent space
	z_mean, _ = vae.encoder.predict(data)
	plt.figure(figsize=(12, 10))
	plt.scatter(z_mean[:, 0], z_mean[:, 1], c=labels)
	plt.colorbar()
	plt.xlabel("z[0]")
	plt.ylabel("z[1]")
	plt.show()


	plot_latent(model, [adjacency_tensor[:8000], feature_tensor[:8000]], qed_tensor[:8000])

	"""
	## Conclusion

	In this example, we combined model architectures from two papers,
	"Automatic chemical design using a data-driven continuous representation of
	molecules" from 2016 and the "MolGAN" paper from 2018. The former paper
	treats SMILES inputs as strings and seeks to generate molecule strings in SMILES format,
	while the later paper considers SMILES inputs as graphs (a combination of adjacency
	matrices and feature matrices) and seeks to generate molecules as graphs.

	This hybrid approach enables a new type of directed gradient-based search through chemical space.

	Example available on HuggingFace

	\| Trained Model \| Demo \|
	\| :--: \| :--: \|
	\| [![Generic badge](https://img.shields.io/badge/%F0%9F%A4%97%20Model-molecule%20generation%20with%20VAE-black.svg)](https://huggingface.co/keras-io/drug-molecule-generation-with-VAE) \| [![Generic badge](https://img.shields.io/badge/%F0%9F%A4%97%20Spaces-molecule%20generation%20with%20VAE-black.svg)](https://huggingface.co/spaces/keras-io/generating-drug-molecule-with-VAE) \|
	"""