GPT-2 for SMILES Unconditional Generation

This repository hosts a GPT-2-based model for generating SMILES strings, trained on the ZINC 15 dataset. The model follows the architecture and hyperparameter setup of MolGPT (Bagal et al., 2021), and has been fine-tuned to generate valid molecular representations with high accuracy.

Model Overview

Architecture and Configuration

The model is built using the GPT-2 base architecture, with the following configuration:

  GPT2Config(
      vocab_size=tokenizer.vocab_size,  # 10,000 tokens
      n_positions=128,
      n_ctx=128,
      n_embd=256,
      n_layer=8,
      n_head=8,
      resid_pdrop=0.1,
      embd_pdrop=0.1,
      attn_pdrop=0.1
  )

The tokenizer was custom-trained using Byte Pair Encoding (BPE) with a vocabulary size of 10,000 tokens. Below is the tokenizer configuration:

  def configure_tokenizer(tokenizer_path):
    tokenizer = PreTrainedTokenizerFast(tokenizer_file=tokenizer_path)
    tokenizer.model_max_length = 128
    tokenizer.pad_token = "<pad>"
    tokenizer.bos_token = "<bos>"
    tokenizer.eos_token = "<eos>"

    return tokenizer

Training Details

Pretraining Dataset: ZINC 15 (35,965,323 SMILES strings)
Hardware: 8 NVIDIA RTX 2080 Ti GPUs
Training Time: 16 hours
Hyperparameters:

  TrainingArguments(
      output_dir=output_dir,
      evaluation_strategy="steps",
      learning_rate=5e-4,
      max_steps=100_000,
      per_device_train_batch_size=128,
      save_steps=10_000,
      save_total_limit=3,
      logging_dir=f"{output_dir}/logs",
      logging_steps=10_000,
      warmup_steps=10_000,
      dataloader_num_workers=4,
      gradient_accumulation_steps=1,
      fp16=True
  )

Model Performance

The validity of generated SMILES was evaluated by generating 10,000 sequences with a fixed temperature of 1. The results were compared to the original MolGPT model:

Dataset/Metric	GPT-2 (This Model)	MolGPT
ZINC 15 Validity	99.68%	N/A
MOSES Validity	N/A	99.4%
GuacaMol Validity	N/A	98.1%

Usage

Install Dependencies Install the required libraries via pip:

  pip install transformers
  pip install tokenizers

Load the Model and Tokenizer To use the model for SMILES generation, follow these steps:

Load tokenizer

from transformers import GPT2LMHeadModel, PreTrainedTokenizerFast

tokenizer_path = "path_to_tokenizer/tokenizer.json"
tokenizer = PreTrainedTokenizerFast(tokenizer_file=tokenizer_path)
tokenizer.pad_token = "<pad>"
tokenizer.bos_token = "<bos>"
tokenizer.eos_token = "<eos>"

Load model

from transformers import GPT2LMHeadModel
model = GPT2LMHeadModel.from_pretrained("jonghyunlee/MolGPT_pretrained-by-ZINC15")

Generate SMILES

def generate_smiles(model, tokenizer, num_sequences=1000, temperature=1.0):
  return model.generate(
      max_length=128,
      num_return_sequences=num_sequences,
      pad_token_id=tokenizer.pad_token_id,
      bos_token_id=tokenizer.bos_token_id,
      eos_token_id=tokenizer.eos_token_id,
      do_sample=True,
      temperature=temperature,
      return_dict_in_generate=True,
  )

Decode generated SMILES

generated_smiles = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]
print(generated_smiles)

Citation

If you use this model in your research, please cite:

Bagal, V., Aggarwal, R., Vinod, P. K., & Priyakumar, U. D. (2021). MolGPT: molecular generation using a transformer-decoder model. Journal of Chemical Information and Modeling, 62(9), 2064-2076.

jonghyunlee
/

MolGPT_pretrained-by-ZINC15