Back to all models
fill-mask mask_token: <mask>
Query this model
🔥 This model is currently loaded and running on the Inference API. ⚠️ This model could not be loaded by the inference API. ⚠️ This model can be loaded on the Inference API on-demand.
JSON Output
API endpoint  

⚡️ Upgrade your account to access the Inference API

							curl -X POST \
-H "Authorization: Bearer YOUR_ORG_OR_USER_API_TOKEN" \
-H "Content-Type: application/json" \
-d '"json encoded string"' \
Share Copied link to clipboard

Monthly model downloads

mrm8488/chEMBL_smiles_v1 mrm8488/chEMBL_smiles_v1
last 30 days



Contributed by

mrm8488 Manuel Romero
146 models

How to use this model directly from the 🤗/transformers library:

Copy to clipboard
from transformers import AutoTokenizer, AutoModelWithLMHead tokenizer = AutoTokenizer.from_pretrained("mrm8488/chEMBL_smiles_v1") model = AutoModelWithLMHead.from_pretrained("mrm8488/chEMBL_smiles_v1")

De Novo Drug Design with MLM

What is it?

An approximation to Generative Recurrent Networks for De Novo Drug Design but training a MLM (RoBERTa like) from scratch.


As mentioned in the paper: Generative artificial intelligence models present a fresh approach to chemogenomics and de novo drug design, as they provide researchers with the ability to narrow down their search of the chemical space and focus on regions of interest. They used a generative recurrent neural network (RNN) containing long short‐term memory (LSTM) cell to capture the syntax of molecular representations in terms of SMILES strings. The learned pattern probabilities can be used for de novo SMILES generation. This molecular design concept eliminates the need for virtual compound library enumeration and enables virtual compound design without requiring secondary or external activity prediction.

My Goal 🎯

By training a MLM from scratch on 438552 (cleaned*) SMILES I wanted to build a model that learns this kind of molecular combinations so that given a partial SMILE it can generate plausible combinations so that it can be proposed as new drugs. By cleaned SMILES I mean that I used their SMILES cleaning script to remove duplicates, salts, and stereochemical information. You can see the detailed process of gathering the data, preprocess it and train the LSTM in their repo.

Fast usage with pipelines 🧪

from transformers import pipeline

fill_mask = pipeline(

# CC(C)CN(CC(OP(=O)(O)O)C(Cc1ccccc1)NC(=O)OC1CCOC1)S(=O)(=O)c1ccc(N)cc1 Atazanavir
smile1 = "CC(C)CN(CC(OP(=O)(O)O)C(Cc1ccccc1)NC(=O)OC1CCOC1)S(=O)(=O)c1ccc(N)<mask>"


# Output:
[{'score': 0.6040295958518982,
  'sequence': '<s> CC(C)CN(CC(OP(=O)(O)O)C(Cc1ccccc1)NC(=O)OC1CCOC1)S(=O)(=O)c1ccc(N)nc</s>',
  'token': 265},
 {'score': 0.2185731679201126,
  'sequence': '<s> CC(C)CN(CC(OP(=O)(O)O)C(Cc1ccccc1)NC(=O)OC1CCOC1)S(=O)(=O)c1ccc(N)N</s>',
  'token': 50},
 {'score': 0.0642734169960022,
  'sequence': '<s> CC(C)CN(CC(OP(=O)(O)O)C(Cc1ccccc1)NC(=O)OC1CCOC1)S(=O)(=O)c1ccc(N)cc</s>',
  'token': 261},
 {'score': 0.01932266168296337,
  'sequence': '<s> CC(C)CN(CC(OP(=O)(O)O)C(Cc1ccccc1)NC(=O)OC1CCOC1)S(=O)(=O)c1ccc(N)CCCl</s>',
  'token': 452},
 {'score': 0.005068355705589056,
  'sequence': '<s> CC(C)CN(CC(OP(=O)(O)O)C(Cc1ccccc1)NC(=O)OC1CCOC1)S(=O)(=O)c1ccc(N)C</s>',
  'token': 39}]


I also created a second version without applying the cleaning SMILES script mentioned above. You can use it in the same way as this one.

fill_mask = pipeline(

Original paper Authors:

Swiss Federal Institute of Technology (ETH), Department of Chemistry and Applied Biosciences, Vladimir–Prelog–Weg 4, 8093, Zurich, Switzerland, Stanford University, Department of Computer Science, 450 Sierra Mall, Stanford, CA, 94305, USA, GmbH, 8049, Zurich, Switzerland, Gisbert Schneider, Email: hc.zhte@trebsig.

Created by Manuel Romero/@mrm8488 | LinkedIn

Made with in Spain