BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning
BindGPT is a new framework for building drug discovery models that leverages compute-efficient pretraining, supervised funetuning, prompting, reinforcement learning, and tool use of LMs. This allows BindGPT to build a single pre-trained model that exhibits state-of-the-art performance in 3D Molecule Generation, 3D Conformer Generation, Pocket-Conditioned 3D Molecule Generation, posing them as downstream tasks for a pretrained model, while previous methods build task-specialized models without task transfer abilities. At the same time, thanks to the fast transformer inference technology, BindGPT is 2 orders of magnitude (100 times) faster than previous methods at generation.
- website: https://bindgpt.github.io
- Repository: https://github.com/insilicomedicine/bindgpt
- Paper: https://arxiv.org/abs/2406.03686
This page provides the pretrained version of BindGPT. The pretrained model is capable of zero-shot molecule generation and conformer generation within the distribution of the Uni-Mol dataset. We also expose finetuned models:
- For the model finetuned on GEOM-DRUGS, visit huggingface.co/insilicomedicine/bindgpt_finetuned
- The model finetuned with Reinforcement Learning on CrossDocked is coming soon
Unconditional generation
The code below provides a minimal standalone example of
sampling molecules from the model. It only depends on
transformers
, tokenizers
, rdkit
, and pytorch
and it's not meant to reproduce the sampling speed reported
in the paper (e.g. it does not use flash-attention, mixed precision,
and large batch sampling).
To reproduce sampling speed, please use the code from our repository:
https://github.com/insilicomedicine/bindgpt
# Download model from Hugginface:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("insilicomedicine/bindgpt_pretrained")
model = AutoModelForCausalLM.from_pretrained("insilicomedicine/bindgpt_pretrained").cuda()
# Generate 10 tokenized molecules without condition
NUM_SAMPLES = 10
start_tokens = tokenizer("<LIGAND>", return_tensors="pt")
outputs = model.generate(
# remove EOS token to continue generation
input_ids=start_tokens['input_ids'][:, :-1].cuda(),
attention_mask=start_tokens['attention_mask'][:, :-1].cuda(),
do_sample=True, max_length=400, num_return_sequences=NUM_SAMPLES
)
# parse results
import re
from rdkit import Chem
def parse_molecule(s):
try:
assert '<LIGAND>' in s and '<XYZ>' in s
_, smiles, xyz = re.split(r'<LIGAND>|<XYZ>', s)
smiles = re.sub(r'\s', '', smiles)
conf = Chem.Conformer()
mol = Chem.MolFromSmiles(smiles)
assert mol is not None
coords = list(map(float, xyz.split(' ')[2:]))
assert len(coords) == (3 * mol.GetNumAtoms())
for j in range(mol.GetNumAtoms()):
conf.SetAtomPosition(j, [coords[3*j],coords[3*j+1],coords[3*j+2]])
mol.AddConformer(conf)
return mol
except AssertionError:
return None
string_molecules = tokenizer.batch_decode(outputs, skip_special_tokens=True)
molecules = [parse_molecule(mol) for mol in string_molecules]
Conformer generation
The code below provides a minimal standalone example of
sampling conformers given molecule from the model. It only depends on
transformers
, tokenizers
, rdkit
, and pytorch
and it's not meant to reproduce the sampling speed reported
in the paper (e.g. it does not use flash-attention, mixed precision,
and large batch sampling).
To reproduce sampling speed, please use the code from our repository:
https://github.com/insilicomedicine/bindgpt
smiles = [
'O=c1n(CCO)c2ccccc2n1CCO',
'Cc1ccc(C#N)cc1S(=O)(=O)NCc1ccnc(OC(C)(C)C)c1',
'COC(=O)Cc1csc(NC(=O)Cc2coc3cc(C)ccc23)n1',
]
# tell the tokenizer to right-align sequences
tokenizer.padding_side = 'left'
# Do not forget to add the <XYZ> token
# after the smiles, otherwise the model might
# want to continue generating the molecule :)
prompts = tokenizer(
["<LIGAND>" + s + '<XYZ>' for s in smiles], return_tensors="pt",
truncation=True, padding=True,
)
# Generate 1 conformer per molecule
outputs = model.generate(
# remove EOS token to continue generation
input_ids=prompts['input_ids'][:, :-1].cuda(),
attention_mask=prompts['attention_mask'][:, :-1].cuda(),
do_sample=True, max_length=400,
# you can combine this type of conditional generation
# with multi-sample generation.
# to sample many conformers per molecule, uncomment this
# num_return_sequences=10
)
# parse results
string_molecules = tokenizer.batch_decode(outputs, skip_special_tokens=True)
molecules = [parse_molecule(mol) for mol in string_molecules]
Usage and License
Please note that all model weights are exclusively licensed for research purposes. The accompanying dataset is licensed under CC BY 4.0, which permits solely non-commercial usage. We emphatically urge all users to adhere to the highest ethical standards when using our models, including maintaining fairness, transparency, and responsibility in their research. Any usage that may lead to harm or pose a detriment to society is strictly forbidden.
References
If you use our repository, please cite the following related paper:
@article{zholus2021bindgpt,
author = {Artem Zholus and Maksim Kuznetsov and Roman Schutski and Rim Shayakhmetov and Daniil Polykovskiy and Sarath Chandar and Alex Zhavoronkov},
title = {BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning},
journal = {arXiv},
year = {2024},
}
- Downloads last month
- 3