WarmMolGenOne

A target specific molecule generator model which is warm started (i.e. initialized) from pretrained biochemical language models and trained on interacting protein-compound pairs, viewing targeted molecular generation as a translation task between protein and molecular languages. It was introduced in the paper, "Exploiting pretrained biochemical language models for targeted drug design", which has been accepted for publication in Bioinformatics Published by Oxford University Press and first released in this repository.

WarmMolGenOne is a Transformer-based encoder-decoder model initialized with Protein RoBERTa and ChemBERTa checkpoints and trained on interacting protein-compound pairs filtered from BindingDB. The model takes a protein sequence as an input and outputs a SMILES sequence.

How to use

from transformers import EncoderDecoderModel, RobertaTokenizer, pipeline
protein_tokenizer = RobertaTokenizer.from_pretrained("gokceuludogan/WarmMolGenOne")
mol_tokenizer = RobertaTokenizer.from_pretrained("seyonec/PubChem10M_SMILES_BPE_450k")
model = EncoderDecoderModel.from_pretrained("gokceuludogan/WarmMolGenOne")
inputs = protein_tokenizer("MENTENSVDSKSIKNLEPKIIHGSESMDSGISLDNSYKMDYPEMGLCIIINNKNFHKSTG", >>> return_tensors="pt")
outputs = model.generate(**inputs, decoder_start_token_id=mol_tokenizer.bos_token_id, 
                          eos_token_id=mol_tokenizer.eos_token_id, pad_token_id=mol_tokenizer.eos_token_id, 
                          max_length=128, num_return_sequences=5, do_sample=True, top_p=0.95)
mol_tokenizer.batch_decode(outputs, skip_special_tokens=True)
# Sample output
# ['Cn1cc(nn1)-c1ccccc1NS(=O)(=O)c1ccc2[nH]ccc2c1',
# 'CC(C)(C)c1[se]nc2sc(cc12)C(O)=O',
# '[O-][N+](=O)c1ccc(CN2CCC(CC2)NC(=O)c2cccc3ccccc23)cc1',
# 'OC(=O)CNC(=O)CCC\\C=C\\CN1[C@@H](Cc2cn(nn2)-c2ccccc2)C(=O)N[C@@H](CCCN2C(S)=NC(C)(C2=O)c2ccc(F)cc2)C1=O',
# 'OCC1(CCC1)C(=O)NCC1CCN(CC1)c1nc(c(s1)-c1ccc2OCOc2c1)C(O)=O']

Citation

@article{10.1093/bioinformatics/btac482,
    author = {Uludoğan, Gökçe and Ozkirimli, Elif and Ulgen, Kutlu O. and Karalı, Nilgün Lütfiye and Özgür, Arzucan},
    title = "{Exploiting Pretrained Biochemical Language Models for Targeted Drug Design}",
    journal = {Bioinformatics},
    year = {2022},
    doi = {10.1093/bioinformatics/btac482},
    url = {https://doi.org/10.1093/bioinformatics/btac482}
}
Downloads last month
110
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model authors have turned it off explicitly.

Spaces using gokceuludogan/WarmMolGenOne 2