Model Card for Model ID

MLM RoBERTa-based pretrained model. Ready to fine-tune on specific tasks.

Model Details

Model Description

MLM RoBERTa-based pretrained model. 2 million of Simplified molecular-input line-entry system (SMILES) were used and BPE as tokenizer.

Developed by: Miguelangel Leon Mayuare
Funded by: This work was supported by national funds through FCT (Fundação para a Ciência e a Tecnologia), under the project - UIDB/04152/2020 (DOI: 10.54499/UIDB/04152/2020) - Centro de Investigação em Gestão de Informação (MagIC)/NOVA IMS). Aleš Popovič was supported by the Slovenian Research and Innovation Agency (ARIS) under research core funding P2-0442.
Shared by: Miguelangel Leon Mayuare
Model type: RoBERTa-based
Language(s) (NLP): SMILES
License: MIT

Model Sources

Paper: Leon, M., Perezhohin, Y., Peres, F. et al. Comparing SMILES and SELFIES tokenization for enhanced chemical language modeling. Sci Rep 14, 25016 (2024). https://doi.org/10.1038/s41598-024-76440-8

Uses

The model instended use is for fine-tuning on dowstream tasks were SMILES is the main input.

Direct Use

The model can be directly used for the classification of chemical compounds and prediction of molecular properties using SMILES representations.

Downstream Use

The model can be fine-tuned for specific tasks such as drug discovery, toxicity prediction, and other cheminformatics applications using specific datasets.

Out-of-Scope Use

The model should not be used for tasks outside of cheminformatics or without proper validation for the specific task. Misuse includes using the model for generating invalid chemical compounds or predictions outside the domain of trained data. Only works with SMILES, for SELFIES search miekmayuare repository.

Bias, Risks, and Limitations

The model may inherit biases from the training data. Limitations include potential overfitting to the pre-training tasks and resource intensity for training and fine-tuning.

Recommendations

2 million SELFIES were used to pretrain the model in order to mitigate missrepresentation (over and under-representation) of any type of molecules. Validation on known datasets for downstream tasks is the best way to see its limitations.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mikemayuare/SMILYBPE")
model = AutoModel.from_pretrained("mikemayuare/SMILYBPE")

Training Details

Training Data

The training data comprised 2 million molecules from the PubChem dataset.

Training Procedure

The models were pre-trained for 20 epochs using the AdamW optimizer on an NVIDIA 3060 GPU with 12GiB of VRAM.

Preprocessing

SMILES were canonicalized and tokenizer was trained on a subset of 1 million molecules from the PubChem dataset.

Training Hyperparameters

Training regime: fp32
Batch size: 32
Number of epochs: 20
Optimizer: AdamW

Speeds, Sizes, Times

Training time was approximately 72 hours on the specified hardware. Checkpoint sizes are approximately 500MB each.

Evaluation

Testing Data

Testing was conducted on MoleculeNet datasets, specifically BBBP, HIV, and Tox21.

Factors

Evaluation metrics were disaggregated by dataset and task type (e.g., binary classification for BBBP).

Metrics

The primary evaluation metric was the ROC-AUC score, which is commonly used for binary classification tasks in cheminformatics (on fine-tuned models).

Results

The models tokenized with APE generally outperformed those tokenized with BPE. SMILES models showed better performance than SELFIES models in most cases.

Summary

The model achieved competitive performance on standard benchmarks, outperforming several baseline models in specific tasks.

Model Examination

Interpretability analyses showed that models tokenized with APE preserved the chemical context better than those with BPE, leading to higher classification accuracy.

Environmental Impact

Carbon emissions were estimated using the Machine Learning Impact calculator.

Hardware Type: NVIDIA 3060 GPU
Hours used: 72 hours
Cloud Provider: Not applicable
Compute Region: Local
Carbon Emitted: Approximately 50 kg CO2eq

Technical Specifications

Model Architecture and Objective

The model architecture is based on RoBERTa with 6 hidden layers, 768 hidden size, 1536 intermediate size, and 12 attention heads.

Compute Infrastructure

Hardware

Type: NVIDIA 3060 GPU
VRAM: 12GiB

Software

Framework: PyTorch
Libraries: transformers, selfies, DeepChem, Optuna

Citation

BibTeX:

@mastersthesis{leon2024chemical,
  title={Chemical Language Modeling},
  author={Miguelangel Augusto Leon Mayuare},
  year={2024},
  school={NOVA Information Management School}
}

APA:

Mayuare, M. A. L. (2024). Chemical Language Modeling (Master's thesis). NOVA Information Management School.

Glossary

SELFIES: A string-based representation of molecules. SMILES: Simplified Molecular Input Line Entry System, a notation for describing the structure of chemical species.

More Information

For more details, refer to the (pending publication)

Model Card Authors

Miguelangel Augusto Leon Mayuare

Model Card Contact

For inquiries, please contact migueleonm@gmail.com

mikemayuare
/

SMILYBPE