|
--- |
|
license: apache-2.0 |
|
base_model: Derify/ModChemBERT-MLM-DAPT |
|
datasets: |
|
- Derify/augmented_canonical_druglike_QED_Pfizer_15M |
|
metrics: |
|
- roc_auc |
|
- rmse |
|
library_name: transformers |
|
tags: |
|
- modernbert |
|
- ModChemBERT |
|
- cheminformatics |
|
- chemical-language-model |
|
- molecular-property-prediction |
|
- mergekit |
|
- merge |
|
pipeline_tag: fill-mask |
|
model-index: |
|
- name: Derify/ModChemBERT-MLM |
|
results: |
|
- task: |
|
type: text-classification |
|
name: Classification (ROC AUC) |
|
dataset: |
|
name: BACE |
|
type: BACE |
|
metrics: |
|
- type: roc_auc |
|
value: 0.8346 |
|
- task: |
|
type: text-classification |
|
name: Classification (ROC AUC) |
|
dataset: |
|
name: BBBP |
|
type: BBBP |
|
metrics: |
|
- type: roc_auc |
|
value: 0.7573 |
|
- task: |
|
type: text-classification |
|
name: Classification (ROC AUC) |
|
dataset: |
|
name: CLINTOX |
|
type: CLINTOX |
|
metrics: |
|
- type: roc_auc |
|
value: 0.9938 |
|
- task: |
|
type: text-classification |
|
name: Classification (ROC AUC) |
|
dataset: |
|
name: HIV |
|
type: HIV |
|
metrics: |
|
- type: roc_auc |
|
value: 0.7737 |
|
- task: |
|
type: text-classification |
|
name: Classification (ROC AUC) |
|
dataset: |
|
name: SIDER |
|
type: SIDER |
|
metrics: |
|
- type: roc_auc |
|
value: 0.6600 |
|
- task: |
|
type: text-classification |
|
name: Classification (ROC AUC) |
|
dataset: |
|
name: TOX21 |
|
type: TOX21 |
|
metrics: |
|
- type: roc_auc |
|
value: 0.7518 |
|
- task: |
|
type: regression |
|
name: Regression (RMSE) |
|
dataset: |
|
name: BACE |
|
type: BACE |
|
metrics: |
|
- type: rmse |
|
value: 0.9665 |
|
- task: |
|
type: regression |
|
name: Regression (RMSE) |
|
dataset: |
|
name: CLEARANCE |
|
type: CLEARANCE |
|
metrics: |
|
- type: rmse |
|
value: 44.0137 |
|
- task: |
|
type: regression |
|
name: Regression (RMSE) |
|
dataset: |
|
name: ESOL |
|
type: ESOL |
|
metrics: |
|
- type: rmse |
|
value: 0.8158 |
|
- task: |
|
type: regression |
|
name: Regression (RMSE) |
|
dataset: |
|
name: FREESOLV |
|
type: FREESOLV |
|
metrics: |
|
- type: rmse |
|
value: 0.4979 |
|
- task: |
|
type: regression |
|
name: Regression (RMSE) |
|
dataset: |
|
name: LIPO |
|
type: LIPO |
|
metrics: |
|
- type: rmse |
|
value: 0.6505 |
|
- task: |
|
type: text-classification |
|
name: Classification (ROC AUC) |
|
dataset: |
|
name: Antimalarial |
|
type: Antimalarial |
|
metrics: |
|
- type: roc_auc |
|
value: 0.8966 |
|
- task: |
|
type: text-classification |
|
name: Classification (ROC AUC) |
|
dataset: |
|
name: Cocrystal |
|
type: Cocrystal |
|
metrics: |
|
- type: roc_auc |
|
value: 0.8654 |
|
- task: |
|
type: text-classification |
|
name: Classification (ROC AUC) |
|
dataset: |
|
name: COVID19 |
|
type: COVID19 |
|
metrics: |
|
- type: roc_auc |
|
value: 0.8132 |
|
- task: |
|
type: regression |
|
name: Regression (RMSE) |
|
dataset: |
|
name: ADME microsom stab human |
|
type: ADME |
|
metrics: |
|
- type: rmse |
|
value: 0.4248 |
|
- task: |
|
type: regression |
|
name: Regression (RMSE) |
|
dataset: |
|
name: ADME microsom stab rat |
|
type: ADME |
|
metrics: |
|
- type: rmse |
|
value: 0.4403 |
|
- task: |
|
type: regression |
|
name: Regression (RMSE) |
|
dataset: |
|
name: ADME permeability |
|
type: ADME |
|
metrics: |
|
- type: rmse |
|
value: 0.5025 |
|
- task: |
|
type: regression |
|
name: Regression (RMSE) |
|
dataset: |
|
name: ADME ppb human |
|
type: ADME |
|
metrics: |
|
- type: rmse |
|
value: 0.8901 |
|
- task: |
|
type: regression |
|
name: Regression (RMSE) |
|
dataset: |
|
name: ADME ppb rat |
|
type: ADME |
|
metrics: |
|
- type: rmse |
|
value: 0.7268 |
|
- task: |
|
type: regression |
|
name: Regression (RMSE) |
|
dataset: |
|
name: ADME solubility |
|
type: ADME |
|
metrics: |
|
- type: rmse |
|
value: 0.4627 |
|
- task: |
|
type: regression |
|
name: Regression (RMSE) |
|
dataset: |
|
name: AstraZeneca CL |
|
type: AstraZeneca |
|
metrics: |
|
- type: rmse |
|
value: 0.4932 |
|
- task: |
|
type: regression |
|
name: Regression (RMSE) |
|
dataset: |
|
name: AstraZeneca LogD74 |
|
type: AstraZeneca |
|
metrics: |
|
- type: rmse |
|
value: 0.7596 |
|
- task: |
|
type: regression |
|
name: Regression (RMSE) |
|
dataset: |
|
name: AstraZeneca PPB |
|
type: AstraZeneca |
|
metrics: |
|
- type: rmse |
|
value: 0.1150 |
|
- task: |
|
type: regression |
|
name: Regression (RMSE) |
|
dataset: |
|
name: AstraZeneca Solubility |
|
type: AstraZeneca |
|
metrics: |
|
- type: rmse |
|
value: 0.8735 |
|
--- |
|
|
|
# ModChemBERT: ModernBERT as a Chemical Language Model |
|
ModChemBERT is a ModernBERT-based chemical language model (CLM), trained on SMILES strings for masked language modeling (MLM) and downstream molecular property prediction (classification & regression). |
|
|
|
## Usage |
|
Install the `transformers` library starting from v4.56.1: |
|
|
|
```bash |
|
pip install -U transformers>=4.56.1 |
|
``` |
|
|
|
### Load Model |
|
```python |
|
from transformers import AutoModelForMaskedLM, AutoTokenizer |
|
|
|
model_id = "Derify/ModChemBERT" |
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
model = AutoModelForMaskedLM.from_pretrained( |
|
model_id, |
|
trust_remote_code=True, |
|
dtype="float16", |
|
device_map="auto", |
|
) |
|
``` |
|
|
|
### Fill-Mask Pipeline |
|
```python |
|
from transformers import pipeline |
|
|
|
fill = pipeline("fill-mask", model=model, tokenizer=tokenizer) |
|
print(fill("c1ccccc1[MASK]")) |
|
``` |
|
|
|
## Architecture |
|
- Backbone: ModernBERT |
|
- Hidden size: 768 |
|
- Intermediate size: 1152 |
|
- Encoder Layers: 22 |
|
- Attention heads: 12 |
|
- Max sequence length: 256 tokens (MLM primarily trained with 128-token sequences) |
|
- Tokenizer: BPE tokenizer using [MolFormer's vocab](https://github.com/emapco/ModChemBERT/blob/main/modchembert/tokenizers/molformer/vocab.json) (2362 tokens) |
|
|
|
## Pooling (Classifier / Regressor Head) |
|
Kallergis et al. [1] demonstrated that the CLM embedding method prior to the prediction head was the strongest contributor to downstream performance among evaluated hyperparameters. |
|
|
|
Behrendt et al. [2] noted that the last few layers contain task-specific information and that pooling methods leveraging information from multiple layers can enhance model performance. Their results further demonstrated that the `max_seq_mha` pooling method was particularly effective in low-data regimes, which is often the case for molecular property prediction tasks. |
|
|
|
Multiple pooling strategies are supported by ModChemBERT to explore their impact on downstream performance: |
|
- `cls`: Last layer [CLS] |
|
- `mean`: Mean over last hidden layer |
|
- `max_cls`: Max over last k layers of [CLS] |
|
- `cls_mha`: MHA with [CLS] as query |
|
- `max_seq_mha`: MHA with max pooled sequence as KV and max pooled [CLS] as query |
|
- `sum_mean`: Sum over all layers then mean tokens |
|
- `sum_sum`: Sum over all layers then sum tokens |
|
- `mean_mean`: Mean over all layers then mean tokens |
|
- `mean_sum`: Mean over all layers then sum tokens |
|
- `max_seq_mean`: Max over last k layers then mean tokens |
|
|
|
Note: ModChemBERT’s `max_seq_mha` differs from MaxPoolBERT [2]. MaxPoolBERT uses PyTorch `nn.MultiheadAttention`, whereas ModChemBERT's `ModChemBertPoolingAttention` adapts ModernBERT’s `ModernBertAttention`. |
|
On ChemBERTa-3 benchmarks this variant produced stronger validation metrics and avoided the training instabilities (sporadic zero / NaN losses and gradient norms) seen with `nn.MultiheadAttention`. Training instability with ModernBERT has been reported in the past ([discussion 1](https://huggingface.co/answerdotai/ModernBERT-base/discussions/59) and [discussion 2](https://huggingface.co/answerdotai/ModernBERT-base/discussions/63)). |
|
|
|
## Training Pipeline |
|
<div align="center"> |
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/656892962693fa22e18b5331/bxNbpgMkU8m60ypyEJoWQ.png" alt="ModChemBERT Training Pipeline" width="650"/> |
|
</div> |
|
|
|
### Rationale for MTR Stage |
|
Following Sultan et al. [3], multi-task regression (physicochemical properties) biases the latent space toward ADME-related representations prior to narrow TAFT specialization. Sultan et al. observed that MLM + DAPT (MTR) outperforms MLM-only, MTR-only, and MTR + DAPT (MTR). |
|
|
|
### Checkpoint Averaging Motivation |
|
Inspired by ModernBERT [4], JaColBERTv2.5 [5], and Llama 3.1 [6], where results show that model merging can enhance generalization or performance while mitigating overfitting to any single fine-tune or annealing checkpoint. |
|
|
|
## Datasets |
|
- Pretraining: [Derify/augmented_canonical_druglike_QED_Pfizer_15M](https://huggingface.co/datasets/Derify/augmented_canonical_druglike_QED_Pfizer_15M) (canonical_smiles column) |
|
- Domain Adaptive Pretraining (DAPT) & Task Adaptive Fine-tuning (TAFT): ADME (6 tasks) + AstraZeneca (4 tasks) datasets that are split using DA4MT's [3] Bemis-Murcko scaffold splitter (see [domain-adaptation-molecular-transformers](https://github.com/emapco/ModChemBERT/blob/main/domain-adaptation-molecular-transformers/da4mt/splitting.py)) |
|
- Benchmarking: |
|
- ChemBERTa-3 [7] |
|
- classification: BACE, BBBP, TOX21, HIV, SIDER, CLINTOX |
|
- regression: ESOL, FREESOLV, LIPO, BACE, CLEARANCE |
|
- Mswahili, et al. [8] proposed additional datasets for benchmarking chemical language models: |
|
- classification: Antimalarial [9], Cocrystal [10], COVID19 [11] |
|
- DAPT/TAFT stage regression datasets: |
|
- ADME [12]: adme_microsom_stab_h, adme_microsom_stab_r, adme_permeability, adme_ppb_h, adme_ppb_r, adme_solubility |
|
- AstraZeneca: astrazeneca_CL, astrazeneca_LogD74, astrazeneca_PPB, astrazeneca_Solubility |
|
|
|
## Benchmarking |
|
Benchmarks were conducted using the ChemBERTa-3 framework. DeepChem scaffold splits were utilized for all datasets, with the exception of the Antimalarial dataset, which employed a random split. Each task was trained for 100 epochs, with results averaged across 3 random seeds. |
|
|
|
The complete hyperparameter configurations for these benchmarks are available here: [ChemBERTa3 configs](https://github.com/emapco/ModChemBERT/tree/main/conf/chemberta3) |
|
|
|
### Evaluation Methodology |
|
- Classification Metric: ROC AUC |
|
- Regression Metric: RMSE |
|
- Aggregation: Mean ± standard deviation of the triplicate results. |
|
- Input Constraints: SMILES truncated / filtered to ≤200 tokens, following ChemBERTa-3's recommendation. |
|
|
|
### Results |
|
<details><summary>Click to expand</summary> |
|
|
|
#### ChemBERTa-3 Classification Datasets (ROC AUC - Higher is better) |
|
|
|
| Model | BACE↑ | BBBP↑ | CLINTOX↑ | HIV↑ | SIDER↑ | TOX21↑ | AVG† | |
|
| ---------------------------------------------------------------------------- | ----------------- | ----------------- | --------------------- | --------------------- | --------------------- | ----------------- | ------ | |
|
| **Tasks** | 1 | 1 | 2 | 1 | 27 | 12 | | |
|
| [ChemBERTa-100M-MLM](https://huggingface.co/DeepChem/ChemBERTa-100M-MLM)* | 0.781 ± 0.019 | 0.700 ± 0.027 | 0.979 ± 0.022 | 0.740 ± 0.013 | 0.611 ± 0.002 | 0.718 ± 0.011 | 0.7548 | |
|
| [c3-MoLFormer-1.1B](https://huggingface.co/DeepChem/MoLFormer-c3-1.1B)* | 0.819 ± 0.019 | 0.735 ± 0.019 | 0.839 ± 0.013 | 0.762 ± 0.005 | 0.618 ± 0.005 | 0.723 ± 0.012 | 0.7493 | |
|
| MoLFormer-LHPC* | **0.887 ± 0.004** | **0.908 ± 0.013** | 0.993 ± 0.004 | 0.750 ± 0.003 | 0.622 ± 0.007 | **0.791 ± 0.014** | 0.8252 | |
|
| | | | | | | | | |
|
| [MLM](https://huggingface.co/Derify/ModChemBERT-MLM) | 0.8065 ± 0.0103 | 0.7222 ± 0.0150 | 0.9709 ± 0.0227 | ***0.7800 ± 0.0133*** | 0.6419 ± 0.0113 | 0.7400 ± 0.0044 | 0.7769 | |
|
| [MLM + DAPT](https://huggingface.co/Derify/ModChemBERT-MLM-DAPT) | 0.8224 ± 0.0156 | 0.7402 ± 0.0095 | 0.9820 ± 0.0138 | 0.7702 ± 0.0020 | 0.6303 ± 0.0039 | 0.7360 ± 0.0036 | 0.7802 | |
|
| [MLM + TAFT](https://huggingface.co/Derify/ModChemBERT-MLM-TAFT) | 0.7924 ± 0.0155 | 0.7282 ± 0.0058 | 0.9725 ± 0.0213 | 0.7770 ± 0.0047 | 0.6542 ± 0.0128 | *0.7646 ± 0.0039* | 0.7815 | |
|
| [MLM + DAPT + TAFT](https://huggingface.co/Derify/ModChemBERT-MLM-DAPT-TAFT) | 0.8213 ± 0.0051 | 0.7356 ± 0.0094 | 0.9664 ± 0.0202 | 0.7750 ± 0.0048 | 0.6415 ± 0.0094 | 0.7263 ± 0.0036 | 0.7777 | |
|
| [MLM + DAPT + TAFT OPT](https://huggingface.co/Derify/ModChemBERT) | *0.8346 ± 0.0045* | *0.7573 ± 0.0120* | ***0.9938 ± 0.0017*** | 0.7737 ± 0.0034 | ***0.6600 ± 0.0061*** | 0.7518 ± 0.0047 | 0.7952 | |
|
|
|
#### ChemBERTa-3 Regression Datasets (RMSE - Lower is better) |
|
|
|
| Model | BACE↓ | CLEARANCE↓ | ESOL↓ | FREESOLV↓ | LIPO↓ | AVG‡ | |
|
| ---------------------------------------------------------------------------- | --------------------- | ---------------------- | --------------------- | --------------------- | --------------------- | ---------------- | |
|
| **Tasks** | 1 | 1 | 1 | 1 | 1 | | |
|
| [ChemBERTa-100M-MLM](https://huggingface.co/DeepChem/ChemBERTa-100M-MLM)* | 1.011 ± 0.038 | 51.582 ± 3.079 | 0.920 ± 0.011 | 0.536 ± 0.016 | 0.758 ± 0.013 | 0.8063 / 10.9614 | |
|
| [c3-MoLFormer-1.1B](https://huggingface.co/DeepChem/MoLFormer-c3-1.1B)* | 1.094 ± 0.126 | 52.058 ± 2.767 | 0.829 ± 0.019 | 0.572 ± 0.023 | 0.728 ± 0.016 | 0.8058 / 11.0562 | |
|
| MoLFormer-LHPC* | 1.201 ± 0.100 | 45.74 ± 2.637 | 0.848 ± 0.031 | 0.683 ± 0.040 | 0.895 ± 0.080 | 0.9068 / 9.8734 | |
|
| | | | | | | |
|
| [MLM](https://huggingface.co/Derify/ModChemBERT-MLM) | 1.0893 ± 0.1319 | 49.0005 ± 1.2787 | 0.8456 ± 0.0406 | 0.5491 ± 0.0134 | 0.7147 ± 0.0062 | 0.7997 / 10.4398 | |
|
| [MLM + DAPT](https://huggingface.co/Derify/ModChemBERT-MLM-DAPT) | 0.9931 ± 0.0258 | 45.4951 ± 0.7112 | 0.9319 ± 0.0153 | 0.6049 ± 0.0666 | 0.6874 ± 0.0040 | 0.8043 / 9.7425 | |
|
| [MLM + TAFT](https://huggingface.co/Derify/ModChemBERT-MLM-TAFT) | 1.0304 ± 0.1146 | 47.8418 ± 0.4070 | ***0.7669 ± 0.0024*** | 0.5293 ± 0.0267 | 0.6708 ± 0.0074 | 0.7493 / 10.1678 | |
|
| [MLM + DAPT + TAFT](https://huggingface.co/Derify/ModChemBERT-MLM-DAPT-TAFT) | 0.9713 ± 0.0224 | ***42.8010 ± 3.3475*** | 0.8169 ± 0.0268 | 0.5445 ± 0.0257 | 0.6820 ± 0.0028 | 0.7537 / 9.1631 | |
|
| [MLM + DAPT + TAFT OPT](https://huggingface.co/Derify/ModChemBERT) | ***0.9665 ± 0.0250*** | 44.0137 ± 1.1110 | 0.8158 ± 0.0115 | ***0.4979 ± 0.0158*** | ***0.6505 ± 0.0126*** | 0.7327 / 9.3889 | |
|
|
|
#### Mswahili, et al. [8] Proposed Classification Datasets (ROC AUC - Higher is better) |
|
|
|
| Model | Antimalarial↑ | Cocrystal↑ | COVID19↑ | AVG† | |
|
| ---------------------------------------------------------------------------- | --------------------- | --------------------- | --------------------- | ------ | |
|
| **Tasks** | 1 | 1 | 1 | | |
|
| [MLM](https://huggingface.co/Derify/ModChemBERT-MLM) | 0.8707 ± 0.0032 | 0.7967 ± 0.0124 | 0.8106 ± 0.0170 | 0.8260 | |
|
| [MLM + DAPT](https://huggingface.co/Derify/ModChemBERT-MLM-DAPT) | 0.8756 ± 0.0056 | 0.8288 ± 0.0143 | 0.8029 ± 0.0159 | 0.8358 | |
|
| [MLM + TAFT](https://huggingface.co/Derify/ModChemBERT-MLM-TAFT) | 0.8832 ± 0.0051 | 0.7866 ± 0.0204 | ***0.8308 ± 0.0026*** | 0.8335 | |
|
| [MLM + DAPT + TAFT](https://huggingface.co/Derify/ModChemBERT-MLM-DAPT-TAFT) | 0.8819 ± 0.0052 | 0.8550 ± 0.0106 | 0.8013 ± 0.0118 | 0.8461 | |
|
| [MLM + DAPT + TAFT OPT](https://huggingface.co/Derify/ModChemBERT) | ***0.8966 ± 0.0045*** | ***0.8654 ± 0.0080*** | 0.8132 ± 0.0195 | 0.8584 | |
|
|
|
#### ADME/AstraZeneca Regression Datasets (RMSE - Lower is better) |
|
|
|
Hyperparameter optimization for the TAFT stage appears to induce overfitting, as the `MLM + DAPT + TAFT OPT` model shows slightly degraded performance on the ADME/AstraZeneca datasets compared to the `MLM + DAPT + TAFT` model. |
|
The `MLM + DAPT + TAFT` model, a merge of unoptimized TAFT checkpoints trained with `max_seq_mean` pooling, achieved the best overall performance across the ADME/AstraZeneca datasets. |
|
|
|
| | ADME | | | | | | AstraZeneca | | | | | |
|
| ---------------------------------------------------------------------------- | ------------------- | ------------------- | ------------------- | ------------------- | ------------------- | ------------------- | ------------------- | ------------------- | ------------------- | ------------------- | ------ | |
|
| Model | microsom_stab_h↓ | microsom_stab_r↓ | permeability↓ | ppb_h↓ | ppb_r↓ | solubility↓ | CL↓ | LogD74↓ | PPB↓ | Solubility↓ | AVG† | |
|
| | | | | | | | | | | | |
|
| **Tasks** | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | | |
|
| [MLM](https://huggingface.co/Derify/ModChemBERT-MLM) | 0.4489 ± 0.0114 | 0.4685 ± 0.0225 | 0.5423 ± 0.0076 | 0.8041 ± 0.0378 | 0.7849 ± 0.0394 | 0.5191 ± 0.0147 | **0.4812 ± 0.0073** | 0.8204 ± 0.0070 | 0.1365 ± 0.0066 | 0.9614 ± 0.0189 | 0.5967 | |
|
| [MLM + DAPT](https://huggingface.co/Derify/ModChemBERT-MLM-DAPT) | **0.4199 ± 0.0064** | 0.4568 ± 0.0091 | 0.5042 ± 0.0135 | 0.8376 ± 0.0629 | 0.8446 ± 0.0756 | 0.4800 ± 0.0118 | 0.5351 ± 0.0036 | 0.8191 ± 0.0066 | 0.1237 ± 0.0022 | 0.9280 ± 0.0088 | 0.5949 | |
|
| [MLM + TAFT](https://huggingface.co/Derify/ModChemBERT-MLM-TAFT) | 0.4375 ± 0.0027 | 0.4542 ± 0.0024 | 0.5202 ± 0.0141 | **0.7618 ± 0.0138** | 0.7027 ± 0.0023 | 0.5023 ± 0.0107 | 0.5104 ± 0.0110 | 0.7599 ± 0.0050 | 0.1233 ± 0.0088 | 0.8730 ± 0.0112 | 0.5645 | |
|
| [MLM + DAPT + TAFT](https://huggingface.co/Derify/ModChemBERT-MLM-DAPT-TAFT) | 0.4206 ± 0.0071 | **0.4400 ± 0.0039** | **0.4899 ± 0.0068** | 0.8927 ± 0.0163 | **0.6942 ± 0.0397** | 0.4641 ± 0.0082 | 0.5022 ± 0.0136 | **0.7467 ± 0.0041** | 0.1195 ± 0.0026 | **0.8564 ± 0.0265** | 0.5626 | |
|
| [MLM + DAPT + TAFT OPT](https://huggingface.co/Derify/ModChemBERT) | 0.4248 ± 0.0041 | 0.4403 ± 0.0046 | 0.5025 ± 0.0029 | 0.8901 ± 0.0123 | 0.7268 ± 0.0090 | **0.4627 ± 0.0083** | 0.4932 ± 0.0079 | 0.7596 ± 0.0044 | **0.1150 ± 0.0002** | 0.8735 ± 0.0053 | 0.5689 | |
|
|
|
|
|
**Bold** indicates the best result in the column; *italic* indicates the best result among ModChemBERT checkpoints.<br/> |
|
\* Published results from the ChemBERTa-3 [7] paper for optimized chemical language models using DeepChem scaffold splits.<br/> |
|
† AVG column shows the mean score across classification tasks.<br/> |
|
‡ AVG column shows the mean scores across regression tasks without and with the clearance score. |
|
|
|
</details> |
|
|
|
## Optimized ModChemBERT Hyperparameters |
|
|
|
<details><summary>Click to expand</summary> |
|
|
|
### TAFT Datasets |
|
Optimal parameters (per dataset) for the `MLM + DAPT + TAFT OPT` merged model: |
|
|
|
| Dataset | Learning Rate | Batch Size | Warmup Ratio | Classifier Pooling | Last k Layers | |
|
| ---------------------- | ------------- | ---------- | ------------ | ------------------ | ------------- | |
|
| adme_microsom_stab_h | 3e-5 | 8 | 0.0 | max_seq_mean | 5 | |
|
| adme_microsom_stab_r | 3e-5 | 16 | 0.2 | max_cls | 3 | |
|
| adme_permeability | 3e-5 | 8 | 0.0 | max_cls | 3 | |
|
| adme_ppb_h | 1e-5 | 32 | 0.1 | max_seq_mean | 5 | |
|
| adme_ppb_r | 1e-5 | 32 | 0.0 | sum_mean | N/A | |
|
| adme_solubility | 3e-5 | 32 | 0.0 | sum_mean | N/A | |
|
| astrazeneca_CL | 3e-5 | 8 | 0.1 | max_seq_mha | 3 | |
|
| astrazeneca_LogD74 | 1e-5 | 8 | 0.0 | max_seq_mean | 5 | |
|
| astrazeneca_PPB | 1e-5 | 32 | 0.0 | max_cls | 3 | |
|
| astrazeneca_Solubility | 1e-5 | 32 | 0.0 | max_seq_mean | 5 | |
|
|
|
### Benchmarking Datasets |
|
Optimal parameters (per dataset) for the `MLM + DAPT + TAFT OPT` merged model: |
|
|
|
| Dataset | Batch Size | Classifier Pooling | Last k Layers | Pooling Attention Dropout | Classifier Dropout | Embedding Dropout | |
|
| ------------------- | ---------- | ------------------ | ------------- | ------------------------- | ------------------ | ----------------- | |
|
| bace_classification | 32 | max_seq_mha | 3 | 0.0 | 0.0 | 0.0 | |
|
| bbbp | 64 | max_cls | 3 | 0.1 | 0.0 | 0.0 | |
|
| clintox | 32 | max_seq_mha | 5 | 0.1 | 0.0 | 0.0 | |
|
| hiv | 32 | max_seq_mha | 3 | 0.0 | 0.0 | 0.0 | |
|
| sider | 32 | mean | N/A | 0.1 | 0.0 | 0.1 | |
|
| tox21 | 32 | max_seq_mha | 5 | 0.1 | 0.0 | 0.0 | |
|
| base_regression | 32 | max_seq_mha | 5 | 0.1 | 0.0 | 0.0 | |
|
| clearance | 32 | max_seq_mha | 5 | 0.1 | 0.0 | 0.0 | |
|
| esol | 64 | sum_mean | N/A | 0.1 | 0.0 | 0.1 | |
|
| freesolv | 32 | max_seq_mha | 5 | 0.1 | 0.0 | 0.0 | |
|
| lipo | 32 | max_seq_mha | 3 | 0.1 | 0.1 | 0.1 | |
|
| antimalarial | 16 | max_seq_mha | 3 | 0.1 | 0.1 | 0.1 | |
|
| cocrystal | 16 | max_cls | 3 | 0.1 | 0.0 | 0.1 | |
|
| covid19 | 16 | sum_mean | N/A | 0.1 | 0.0 | 0.1 | |
|
|
|
</details> |
|
|
|
## Intended Use |
|
* Primary: Research and development for molecular property prediction, experimentation with pooling strategies, and as a foundational model for downstream applications. |
|
* Appropriate for: Binary / multi-class classification (e.g., toxicity, activity) and single-task or multi-task regression (e.g., solubility, clearance) after fine-tuning. |
|
* Not intended for generating novel molecules. |
|
|
|
## Limitations |
|
- Out-of-domain performance may degrade for: very long (>128 token) SMILES, inorganic / organometallic compounds, polymers, or charged / enumerated tautomers are not well represented in training. |
|
- No guarantee of synthesizability, safety, or biological efficacy. |
|
|
|
## Ethical Considerations & Responsible Use |
|
- Potential biases arise from training corpora skewed to drug-like space. |
|
- Do not deploy in clinical or regulatory settings without rigorous, domain-specific validation. |
|
|
|
## Hardware |
|
Training and experiments were performed on 2 NVIDIA RTX 3090 GPUs. |
|
|
|
## Citation |
|
If you use ModChemBERT in your research, please cite the checkpoint and the following: |
|
``` |
|
@software{cortes-2025-modchembert, |
|
author = {Emmanuel Cortes}, |
|
title = {ModChemBERT: ModernBERT as a Chemical Language Model}, |
|
year = {2025}, |
|
publisher = {GitHub}, |
|
howpublished = {GitHub repository}, |
|
url = {https://github.com/emapco/ModChemBERT} |
|
} |
|
``` |
|
|
|
## References |
|
1. Kallergis, G., Asgari, E., Empting, M. et al. Domain adaptable language modeling of chemical compounds identifies potent pathoblockers for Pseudomonas aeruginosa. Commun Chem 8, 114 (2025). https://doi.org/10.1038/s42004-025-01484-4 |
|
2. Behrendt, Maike, Stefan Sylvius Wagner, and Stefan Harmeling. "MaxPoolBERT: Enhancing BERT Classification via Layer-and Token-Wise Aggregation." arXiv preprint arXiv:2505.15696 (2025). |
|
3. Sultan, Afnan, et al. "Transformers for molecular property prediction: Domain adaptation efficiently improves performance." arXiv preprint arXiv:2503.03360 (2025). |
|
4. Warner, Benjamin, et al. "Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference." arXiv preprint arXiv:2412.13663 (2024). |
|
5. Clavié, Benjamin. "JaColBERTv2.5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources." arXiv preprint arXiv:2407.20750 (2024). |
|
6. Grattafiori, Aaron, et al. "The llama 3 herd of models." arXiv preprint arXiv:2407.21783 (2024). |
|
7. Singh R, Barsainyan AA, Irfan R, Amorin CJ, He S, Davis T, et al. ChemBERTa-3: An Open Source Training Framework for Chemical Foundation Models. ChemRxiv. 2025; doi:10.26434/chemrxiv-2025-4glrl-v2 This content is a preprint and has not been peer-reviewed. |
|
8. Mswahili, M.E., Hwang, J., Rajapakse, J.C. et al. Positional embeddings and zero-shot learning using BERT for molecular-property prediction. J Cheminform 17, 17 (2025). https://doi.org/10.1186/s13321-025-00959-9 |
|
9. Mswahili, M.E.; Ndomba, G.E.; Jo, K.; Jeong, Y.-S. Graph Neural Network and BERT Model for Antimalarial Drug Predictions Using Plasmodium Potential Targets. Applied Sciences, 2024, 14(4), 1472. https://doi.org/10.3390/app14041472 |
|
10. Mswahili, M.E.; Lee, M.-J.; Martin, G.L.; Kim, J.; Kim, P.; Choi, G.J.; Jeong, Y.-S. Cocrystal Prediction Using Machine Learning Models and Descriptors. Applied Sciences, 2021, 11, 1323. https://doi.org/10.3390/app11031323 |
|
11. Harigua-Souiai, E.; Heinhane, M.M.; Abdelkrim, Y.Z.; Souiai, O.; Abdeljaoued-Tej, I.; Guizani, I. Deep Learning Algorithms Achieved Satisfactory Predictions When Trained on a Novel Collection of Anticoronavirus Molecules. Frontiers in Genetics, 2021, 12:744170. https://doi.org/10.3389/fgene.2021.744170 |
|
12. Cheng Fang, Ye Wang, Richard Grater, Sudarshan Kapadnis, Cheryl Black, Patrick Trapa, and Simone Sciabola. "Prospective Validation of Machine Learning Algorithms for Absorption, Distribution, Metabolism, and Excretion Prediction: An Industrial Perspective" Journal of Chemical Information and Modeling 2023 63 (11), 3263-3274 https://doi.org/10.1021/acs.jcim.3c00160 |
|
|