{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Scaling inputs and outputs" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import torch\n", "from chemprop.models import MPNN\n", "from chemprop.nn import BondMessagePassing, NormAggregation, RegressionFFN\n", "from chemprop.nn.transforms import ScaleTransform, UnscaleTransform, GraphTransform" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is an example [dataset](./data/datasets.ipynb) with extra atom and bond features, extra atom descriptors, and extra [datapoint](./data/datapoints.ipynb) descriptors." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from chemprop.data import MoleculeDatapoint, MoleculeDataset\n", "\n", "smis = [\"CC\", \"CN\", \"CO\", \"CF\", \"CP\", \"CS\", \"CI\"]\n", "ys = np.random.rand(len(smis), 1) * 100\n", "\n", "n_datapoints = len(smis)\n", "n_atoms = 2\n", "n_bonds = 1\n", "n_extra_atom_features = 3\n", "n_extra_bond_features = 4\n", "n_extra_atom_descriptors = 5\n", "n_extra_datapoint_descriptors = 6\n", "\n", "extra_atom_features = np.random.rand(n_datapoints, n_atoms, n_extra_atom_features)\n", "extra_bond_features = np.random.rand(n_datapoints, n_bonds, n_extra_bond_features)\n", "extra_atom_descriptors = np.random.rand(n_datapoints, n_atoms, n_extra_atom_descriptors)\n", "extra_datapoint_descriptors = np.random.rand(n_datapoints, n_extra_datapoint_descriptors)\n", "\n", "datapoints = [\n", " MoleculeDatapoint.from_smi(smi, y, x_d=x_d, V_f=V_f, E_f=E_f, V_d=V_d)\n", " for smi, y, x_d, V_f, E_f, V_d in zip(\n", " smis,\n", " ys,\n", " extra_datapoint_descriptors,\n", " extra_atom_features,\n", " extra_bond_features,\n", " extra_atom_descriptors,\n", " )\n", "]\n", "train_dset = MoleculeDataset(datapoints[:3])\n", "val_dset = MoleculeDataset(datapoints[3:5])\n", "test_dset = MoleculeDataset(datapoints[5:])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Scaling targets - FFN" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scaling the target values before training can improve model performance and make training faster. The scaler for the targets should be fit to the training dataset and then applied to the validation dataset. This scaler is *not* applied to the test dataset. Instead the scaler is used to make an `UnscaleTransform` which is given to the predictor (FFN) layer and used automatically during inference. \n", "\n", "Note that currently the output_transform is saved both in the model's state_dict and and in the model's hyperparameters. This may be changed in the future to align with `lightning`'s recommendations. You can ignore any messages about this." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "output_scaler = train_dset.normalize_targets()\n", "val_dset.normalize_targets(output_scaler)\n", "# test_dset targets not scaled\n", "\n", "output_transform = UnscaleTransform.from_standard_scaler(output_scaler)\n", "\n", "ffn = RegressionFFN(output_transform=output_transform)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Scaling extra atom and bond features - Message Passing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The atom and bond features generated by Chemprop [featurizers](./featurizers/molgraph_molecule_featurizer.ipynb) are either multi-hot or on the order of 1. We recommend scaling extra atom and bond features to also be on the order of 1. Like the target scaler, these scalers are fit to the training data, applied to the validation data, and then saved to the model (in this case the message passing layer) so that they are applied automatically to the test dataset during inference." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
StandardScaler()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
StandardScaler()