---
license: mit
inference: false
tags:

- molecule-generation
- cheminformatics
- targeted-drug-design
- biochemical-language-models
---
## WarmMolGenTwo

A target-specific molecule generator model which is warm started (i.e. initialized) from pretrained biochemical language models and trained on interacting protein-compound pairs, viewing targeted molecular generation as a translation task between protein and molecular languages. It was introduced in the paper, "Exploiting pretrained biochemical language models for
targeted drug design", which has been accepted for publication in *Bioinformatics* Published by Oxford University Press and first released in [this repository](https://github.com/boun-tabi/biochemical-lms-for-drug-design).

WarmMolGenTwo is a Transformer-based encoder-decoder model initialized with [Protein RoBERTa](https://github.com/PaccMann/paccmann_proteomics) and [ChemBERTaLM](https://huggingface.co/gokceuludogan/ChemBERTaLM) checkpoints, and then, trained on interacting protein-compound pairs filtered from [BindingDB](https://www.bindingdb.org/rwd/bind/index.jsp). The model takes a protein sequence as an input and outputs a SMILES sequence. 

## How to use

```python
from transformers import EncoderDecoderModel, RobertaTokenizer, pipeline
protein_tokenizer = RobertaTokenizer.from_pretrained("gokceuludogan/WarmMolGenTwo")
mol_tokenizer = RobertaTokenizer.from_pretrained("seyonec/PubChem10M_SMILES_BPE_450k")
model = EncoderDecoderModel.from_pretrained("gokceuludogan/WarmMolGenTwo")
inputs = protein_tokenizer("MENTENSVDSKSIKNLEPKIIHGSESMDSGISLDNSYKMDYPEMGLCIIINNKNFHKSTG", >>> return_tensors="pt")
outputs = model.generate(**inputs, decoder_start_token_id=mol_tokenizer.bos_token_id, 
                          eos_token_id=mol_tokenizer.eos_token_id, pad_token_id=mol_tokenizer.eos_token_id, 
                          max_length=128, num_return_sequences=5, do_sample=True, top_p=0.95)
mol_tokenizer.batch_decode(outputs, skip_special_tokens=True)
# Sample output
['CCOC(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)NCCC[C@@H](NC(=O)[C@H](Cc1ccccc1)NC(=O)Cc1ccc(O)cc1)C(C)C',
 'CCC(C)[C@H](NC(=O)Cn1nc(-c2cccc3ccccc23)c2cnccc2c1=O)C(O)=O',
 'CC(C)[C@H](NC(=O)[C@H](CC(O)=O)NC(=O)[C@@H]1C[C@H]1c1ccccc1)C(=O)N[C@@H](Cc1c[nH]c2ccccc12)C(=O)OC(C)(C)C',
 'CC[C@@H](C)[C@H](NC(=O)\\C=C\\C(C)\\C=C/C=C(/C)\\C=C(/C)\\C)C(=O)N[C@@H](CC(O)=O)C(=O)N[C@@H](CC(O)=O)C(=O)N[C@@H](Cc1cc(O)c(O)c(O)c1)C(O)=O',
 'CN1C[C@H](Cn2cnc3cc(O)ccc23)Oc2ccc(cc12)C(F)(F)F']
```

## Citation
```bibtex
@article{10.1093/bioinformatics/btac482,
    author = {Uludoğan, Gökçe and Ozkirimli, Elif and Ulgen, Kutlu O. and Karalı, Nilgün Lütfiye and Özgür, Arzucan},
    title = "{Exploiting Pretrained Biochemical Language Models for Targeted Drug Design}",
    journal = {Bioinformatics},
    year = {2022},
    doi = {10.1093/bioinformatics/btac482},
    url = {https://doi.org/10.1093/bioinformatics/btac482}
}
```