File size: 5,836 Bytes
449f87e 5909075 54dd159 56f2cd5 449f87e 5909075 449f87e 2ce84eb 449f87e d9c3e6f 449f87e 1ff7948 10ee990 75e99a4 449f87e 525955d 449f87e d9c3e6f 449f87e 2ce84eb 449f87e d9c3e6f 449f87e a1d11fa 449f87e 5909075 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
---
license: apache-2.0
metrics:
- accuracy
pipeline_tag: feature-extraction
tags:
- chemistry
- foundation models
- AI4Science
- materials
- molecules
library_name: transformers
base_model: ibm/materials.smi-ted
---
# Introduction to IBM's Foundation Models for Materials
Welcome to IBM's series of large foundation models for sustainable materials. Our models span a variety of representations and modalities, including SMILES, SELFIES, 3D atom positions, 3D density grids, molecular graphs, and other formats. These models are designed to support and advance research in materials science and chemistry.
GitHub: [GitHub Link](https://github.com/IBM/materials/tree/main)
# SMILES-based Transformer Encoder-Decoder (SMI-TED)
This repository provides PyTorch source code associated with our publication, "A Large Encoder-Decoder Family of Foundation Models for Chemical Language".
Paper: [Arxiv Link](https://github.com/IBM/materials/blob/main/smi-ted/paper/smi_ted_preprint.pdf)
For more information contact: eduardo.soares@ibm.com or evital@br.ibm.com.
![ted-smi](smi-ted.png)
## Introduction
We present a large encoder-decoder chemical foundation model, SMILES-based Transformer Encoder-Decoder (SMI-TED), pre-trained on a curated dataset of 91 million SMILES samples sourced from PubChem, equivalent to 4 billion molecular tokens. SMI-TED supports various complex tasks, including quantum property prediction, with two main variants (289M and 8X289M). Our experiments across multiple benchmark datasets demonstrate state-of-the-art performance for various tasks. For more information contact: eduardo.soares@ibm.com or evital@br.ibm.com.
## Table of Contents
1. [Getting Started](#getting-started)
1. [Pretrained Models and Training Logs](#pretrained-models-and-training-logs)
2. [Replicating Conda Environment](#replicating-conda-environment)
2. [Pretraining](#pretraining)
3. [Finetuning](#finetuning)
4. [Feature Extraction](#feature-extraction)
5. [Citations](#citations)
## Getting Started
**This code and environment have been tested on Nvidia V100s and Nvidia A100s**
### Pretrained Models and Training Logs
We provide checkpoints of the SMI-TED model pre-trained on a dataset of ~91M molecules curated from PubChem. The pre-trained model shows competitive performance on classification and regression benchmarks from MoleculeNet.
Add the SMI-TED `pre-trained weights.pt` to the `inference/` or `finetune/` directory according to your needs. The directory structure should look like the following:
```
inference/
βββ smi_ted_light
β βββ smi_ted_light.pt
β βββ bert_vocab_curated.txt
β βββ load.py
```
and/or:
```
finetune/
βββ smi_ted_light
β βββ smi_ted_light.pt
β βββ bert_vocab_curated.txt
β βββ load.py
```
### Replicating Conda Environment
Follow these steps to replicate our Conda environment and install the necessary libraries:
#### Create and Activate Conda Environment
```
conda create --name smi-ted-env python=3.8.18
conda activate smi-ted-env
```
#### Install Packages with Conda
```
conda install pytorch=1.13.1 cudatoolkit=11.4 -c pytorch
conda install numpy=1.23.5 pandas=2.0.3
conda install rdkit=2021.03.5 -c conda-forge
```
#### Install Packages with Pip
```
pip install transformers==4.6.0 pytorch-fast-transformers==0.4.0 torch-optimizer==0.3.0 datasets==1.6.2 scikit-learn==1.3.2 scipy==1.12.0 tqdm==4.66.1
```
## Pretraining
For pretraining, we use two strategies: the masked language model method to train the encoder part and an encoder-decoder strategy to refine SMILES reconstruction and improve the generated latent space.
SMI-TED is pre-trained on canonicalized and curated 91M SMILES from PubChem with the following constraints:
- Compounds are filtered to a maximum length of 202 tokens during preprocessing.
- A 95/5/0 split is used for encoder training, with 5% of the data for decoder pretraining.
- A 100/0/0 split is also used to train the encoder and decoder directly, enhancing model performance.
The pretraining code provides examples of data processing and model training on a smaller dataset, requiring 8 A100 GPUs.
To pre-train the two variants of the SMI-TED model, run:
```
bash training/run_model_light_training.sh
```
or
```
bash training/run_model_large_training.sh
```
Use `train_model_D.py` to train only the decoder or `train_model_ED.py` to train both the encoder and decoder.
## Finetuning
The finetuning datasets and environment can be found in the [finetune](https://github.com/IBM/materials/tree/main/smi-ted/finetune) directory. After setting up the environment, you can run a finetuning task with:
```
bash finetune/smi_ted_light/esol/run_finetune_esol.sh
```
Finetuning training/checkpointing resources will be available in directories named `checkpoint_<measure_name>`.
## Feature Extraction
The example notebook [smi_ted_encoder_decoder_example.ipynb](https://github.com/IBM/materials/blob/main/smi-ted/notebooks/smi_ted_encoder_decoder_example.ipynb) contains code to load checkpoint files and use the pre-trained model for encoder and decoder tasks. It also includes examples of classification and regression tasks.
To load smi-ted, you can simply use:
```python
model = load_smi_ted(
folder='../inference/smi_ted_light',
ckpt_filename='smi_ted_light.pt'
)
```
or
```python
with open('model_weights.bin', 'rb') as f:
state_dict = torch.load(f)
model.load_state_dict(state_dict)
)
```
To encode SMILES into embeddings, you can use:
```python
with torch.no_grad():
encoded_embeddings = model.encode(df['SMILES'], return_torch=True)
```
For decoder, you can use the function, so you can return from embeddings to SMILES strings:
```python
with torch.no_grad():
decoded_smiles = model.decode(encoded_embeddings)
``` |