Back to all models
fill-mask mask_token: <mask>
Query this model
πŸ”₯ This model is currently loaded and running on the Inference API. ⚠️ This model could not be loaded by the inference API. ⚠️ This model can be loaded on the Inference API on-demand.
JSON Output
API endpoint  

⚑️ Upgrade your account to access the Inference API

Share Copied link to clipboard

Monthly model downloads

seyonec/ChemBERTa-zinc-base-v1 seyonec/ChemBERTa-zinc-base-v1
last 30 days



Contributed by

seyonec Seyone Chithrananda
34 models

How to use this model directly from the πŸ€—/transformers library:

Copy to clipboard
from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("seyonec/ChemBERTa-zinc-base-v1") model = AutoModelForMaskedLM.from_pretrained("seyonec/ChemBERTa-zinc-base-v1")

ChemBERTa: Training a BERT-like transformer model for masked language modelling of chemical SMILES strings.

Deep learning for chemistry and materials science remains a novel field with lots of potiential. However, the popularity of transfer learning based methods in areas such as NLP and computer vision have not yet been effectively developed in computational chemistry + machine learning. Using HuggingFace's suite of models and the ByteLevel tokenizer, we are able to train on a large corpus of 100k SMILES strings from a commonly known benchmark dataset, ZINC.

Training RoBERTa over 5 epochs, the model achieves a decent loss of 0.398, but may likely continue to decline if trained for a larger number of epochs. The model can predict tokens within a SMILES sequence/molecule, allowing for variants of a molecule within discoverable chemical space to be predicted.

By applying the representations of functional groups and atoms learned by the model, we can try to tackle problems of toxicity, solubility, drug-likeness, and synthesis accessibility on smaller datasets using the learned representations as features for graph convolution and attention models on the graph structure of molecules, as well as fine-tuning of BERT. Finally, we propose the use of attention visualization as a helpful tool for chemistry practitioners and students to quickly identify important substructures in various chemical properties.

Additionally, visualization of the attention mechanism have been seen through previous research as incredibly valuable towards chemical reaction classification. The applications of open-sourcing large-scale transformer models such as RoBERTa with HuggingFace may allow for the acceleration of these individual research directions.

A link to a repository which includes the training, uploading and evaluation notebook (with sample predictions on compounds such as Remdesivir) can be found here. All of the notebooks can be copied into a new Colab runtime for easy execution.

Thanks for checking this out!

  • Seyone