ModChemBERT: ModernBERT as a Chemical Language Model

ModChemBERT is a ModernBERT-based chemical language model (CLM), trained on SMILES strings for masked language modeling (MLM) and downstream molecular property prediction (classification & regression).

Usage

Install the transformers library starting from v4.56.1:

pip install -U transformers>=4.56.1

Load Model

from transformers import AutoModelForMaskedLM, AutoTokenizer

model_id = "Derify/ModChemBERT"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    dtype="float16",
    device_map="auto",
)

Fill-Mask Pipeline

from transformers import pipeline

fill = pipeline("fill-mask", model=model, tokenizer=tokenizer)
print(fill("c1ccccc1[MASK]"))

Architecture

Backbone: ModernBERT
Hidden size: 768
Intermediate size: 1152
Encoder Layers: 22
Attention heads: 12
Max sequence length: 256 tokens (MLM primarily trained with 128-token sequences)
Tokenizer: BPE tokenizer using MolFormer's vocab (2362 tokens)

Pooling (Classifier / Regressor Head)

Kallergis et al. [1] demonstrated that the CLM embedding method prior to the prediction head was the strongest contributor to downstream performance among evaluated hyperparameters.

Behrendt et al. [2] noted that the last few layers contain task-specific information and that pooling methods leveraging information from multiple layers can enhance model performance. Their results further demonstrated that the max_seq_mha pooling method was particularly effective in low-data regimes, which is often the case for molecular property prediction tasks.

Multiple pooling strategies are supported by ModChemBERT to explore their impact on downstream performance:

cls: Last layer [CLS]
mean: Mean over last hidden layer
max_cls: Max over last k layers of [CLS]
cls_mha: MHA with [CLS] as query
max_seq_mha: MHA with max pooled sequence as KV and max pooled [CLS] as query
sum_mean: Sum over all layers then mean tokens
sum_sum: Sum over all layers then sum tokens
mean_mean: Mean over all layers then mean tokens
mean_sum: Mean over all layers then sum tokens
max_seq_mean: Max over last k layers then mean tokens

Note: ModChemBERT’s max_seq_mha differs from MaxPoolBERT [2]. MaxPoolBERT uses PyTorch nn.MultiheadAttention, whereas ModChemBERT's ModChemBertPoolingAttention adapts ModernBERT’s ModernBertAttention. On ChemBERTa-3 benchmarks this variant produced stronger validation metrics and avoided the training instabilities (sporadic zero / NaN losses and gradient norms) seen with nn.MultiheadAttention. Training instability with ModernBERT has been reported in the past (discussion 1 and discussion 2).

Training Pipeline

Rationale for MTR Stage

Following Sultan et al. [3], multi-task regression (physicochemical properties) biases the latent space toward ADME-related representations prior to narrow TAFT specialization. Sultan et al. observed that MLM + DAPT (MTR) outperforms MLM-only, MTR-only, and MTR + DAPT (MTR).

Checkpoint Averaging Motivation

Inspired by ModernBERT [4], JaColBERTv2.5 [5], and Llama 3.1 [6], where results show that model merging can enhance generalization or performance while mitigating overfitting to any single fine-tune or annealing checkpoint.

Datasets

Pretraining: Derify/augmented_canonical_druglike_QED_Pfizer_15M (canonical_smiles column)
Domain Adaptive Pretraining (DAPT) & Task Adaptive Fine-tuning (TAFT): ADME (6 tasks) + AstraZeneca (4 tasks) datasets that are split using DA4MT's [3] Bemis-Murcko scaffold splitter (see domain-adaptation-molecular-transformers)
Benchmarking:
- ChemBERTa-3 [7]
  - classification: BACE, BBBP, TOX21, HIV, SIDER, CLINTOX
  - regression: ESOL, FREESOLV, LIPO, BACE, CLEARANCE
- Mswahili, et al. [8] proposed additional datasets for benchmarking chemical language models:
  - classification: Antimalarial [9], Cocrystal [10], COVID19 [11]
- DAPT/TAFT stage regression datasets:
  - ADME [12]: adme_microsom_stab_h, adme_microsom_stab_r, adme_permeability, adme_ppb_h, adme_ppb_r, adme_solubility
  - AstraZeneca: astrazeneca_CL, astrazeneca_LogD74, astrazeneca_PPB, astrazeneca_Solubility

Benchmarking

Benchmarks were conducted using the ChemBERTa-3 framework. DeepChem scaffold splits were utilized for all datasets, with the exception of the Antimalarial dataset, which employed a random split. Each task was trained for 100 epochs, with results averaged across 3 random seeds.

The complete hyperparameter configurations for these benchmarks are available here: ChemBERTa3 configs

Evaluation Methodology

Classification Metric: ROC AUC
Regression Metric: RMSE
Aggregation: Mean ± standard deviation of the triplicate results.
Input Constraints: SMILES truncated / filtered to ≤200 tokens, following ChemBERTa-3's recommendation.

Results

Click to expand

ChemBERTa-3 Classification Datasets (ROC AUC - Higher is better)

Model	BACE↑	BBBP↑	CLINTOX↑	HIV↑	SIDER↑	TOX21↑	AVG†
Tasks	1	1	2	1	27	12
ChemBERTa-100M-MLM*	0.781 ± 0.019	0.700 ± 0.027	0.979 ± 0.022	0.740 ± 0.013	0.611 ± 0.002	0.718 ± 0.011	0.7548
c3-MoLFormer-1.1B*	0.819 ± 0.019	0.735 ± 0.019	0.839 ± 0.013	0.762 ± 0.005	0.618 ± 0.005	0.723 ± 0.012	0.7493
MoLFormer-LHPC*	0.887 ± 0.004	0.908 ± 0.013	0.993 ± 0.004	0.750 ± 0.003	0.622 ± 0.007	0.791 ± 0.014	0.8252

MLM	0.8065 ± 0.0103	0.7222 ± 0.0150	0.9709 ± 0.0227	0.7800 ± 0.0133	0.6419 ± 0.0113	0.7400 ± 0.0044	0.7769
MLM + DAPT	0.8224 ± 0.0156	0.7402 ± 0.0095	0.9820 ± 0.0138	0.7702 ± 0.0020	0.6303 ± 0.0039	0.7360 ± 0.0036	0.7802
MLM + TAFT	0.7924 ± 0.0155	0.7282 ± 0.0058	0.9725 ± 0.0213	0.7770 ± 0.0047	0.6542 ± 0.0128	0.7646 ± 0.0039	0.7815
MLM + DAPT + TAFT	0.8213 ± 0.0051	0.7356 ± 0.0094	0.9664 ± 0.0202	0.7750 ± 0.0048	0.6415 ± 0.0094	0.7263 ± 0.0036	0.7777
MLM + DAPT + TAFT OPT	0.8346 ± 0.0045	0.7573 ± 0.0120	0.9938 ± 0.0017	0.7737 ± 0.0034	0.6600 ± 0.0061	0.7518 ± 0.0047	0.7952

ChemBERTa-3 Regression Datasets (RMSE - Lower is better)

Model	BACE↓	CLEARANCE↓	ESOL↓	FREESOLV↓	LIPO↓	AVG‡
Tasks	1	1	1	1	1
ChemBERTa-100M-MLM*	1.011 ± 0.038	51.582 ± 3.079	0.920 ± 0.011	0.536 ± 0.016	0.758 ± 0.013	0.8063 / 10.9614
c3-MoLFormer-1.1B*	1.094 ± 0.126	52.058 ± 2.767	0.829 ± 0.019	0.572 ± 0.023	0.728 ± 0.016	0.8058 / 11.0562
MoLFormer-LHPC*	1.201 ± 0.100	45.74 ± 2.637	0.848 ± 0.031	0.683 ± 0.040	0.895 ± 0.080	0.9068 / 9.8734

MLM	1.0893 ± 0.1319	49.0005 ± 1.2787	0.8456 ± 0.0406	0.5491 ± 0.0134	0.7147 ± 0.0062	0.7997 / 10.4398
MLM + DAPT	0.9931 ± 0.0258	45.4951 ± 0.7112	0.9319 ± 0.0153	0.6049 ± 0.0666	0.6874 ± 0.0040	0.8043 / 9.7425
MLM + TAFT	1.0304 ± 0.1146	47.8418 ± 0.4070	0.7669 ± 0.0024	0.5293 ± 0.0267	0.6708 ± 0.0074	0.7493 / 10.1678
MLM + DAPT + TAFT	0.9713 ± 0.0224	42.8010 ± 3.3475	0.8169 ± 0.0268	0.5445 ± 0.0257	0.6820 ± 0.0028	0.7537 / 9.1631
MLM + DAPT + TAFT OPT	0.9665 ± 0.0250	44.0137 ± 1.1110	0.8158 ± 0.0115	0.4979 ± 0.0158	0.6505 ± 0.0126	0.7327 / 9.3889

Mswahili, et al. [8] Proposed Classification Datasets (ROC AUC - Higher is better)

Model	Antimalarial↑	Cocrystal↑	COVID19↑	AVG†
Tasks	1	1	1
MLM	0.8707 ± 0.0032	0.7967 ± 0.0124	0.8106 ± 0.0170	0.8260
MLM + DAPT	0.8756 ± 0.0056	0.8288 ± 0.0143	0.8029 ± 0.0159	0.8358
MLM + TAFT	0.8832 ± 0.0051	0.7866 ± 0.0204	0.8308 ± 0.0026	0.8335
MLM + DAPT + TAFT	0.8819 ± 0.0052	0.8550 ± 0.0106	0.8013 ± 0.0118	0.8461
MLM + DAPT + TAFT OPT	0.8966 ± 0.0045	0.8654 ± 0.0080	0.8132 ± 0.0195	0.8584

ADME/AstraZeneca Regression Datasets (RMSE - Lower is better)

Hyperparameter optimization for the TAFT stage appears to induce overfitting, as the MLM + DAPT + TAFT OPT model shows slightly degraded performance on the ADME/AstraZeneca datasets compared to the MLM + DAPT + TAFT model. The MLM + DAPT + TAFT model, a merge of unoptimized TAFT checkpoints trained with max_seq_mean pooling, achieved the best overall performance across the ADME/AstraZeneca datasets.

	ADME						AstraZeneca
Model	microsom_stab_h↓	microsom_stab_r↓	permeability↓	ppb_h↓	ppb_r↓	solubility↓	CL↓	LogD74↓	PPB↓	Solubility↓	AVG†

Tasks	1	1	1	1	1	1	1	1	1	1
MLM	0.4489 ± 0.0114	0.4685 ± 0.0225	0.5423 ± 0.0076	0.8041 ± 0.0378	0.7849 ± 0.0394	0.5191 ± 0.0147	0.4812 ± 0.0073	0.8204 ± 0.0070	0.1365 ± 0.0066	0.9614 ± 0.0189	0.5967
MLM + DAPT	0.4199 ± 0.0064	0.4568 ± 0.0091	0.5042 ± 0.0135	0.8376 ± 0.0629	0.8446 ± 0.0756	0.4800 ± 0.0118	0.5351 ± 0.0036	0.8191 ± 0.0066	0.1237 ± 0.0022	0.9280 ± 0.0088	0.5949
MLM + TAFT	0.4375 ± 0.0027	0.4542 ± 0.0024	0.5202 ± 0.0141	0.7618 ± 0.0138	0.7027 ± 0.0023	0.5023 ± 0.0107	0.5104 ± 0.0110	0.7599 ± 0.0050	0.1233 ± 0.0088	0.8730 ± 0.0112	0.5645
MLM + DAPT + TAFT	0.4206 ± 0.0071	0.4400 ± 0.0039	0.4899 ± 0.0068	0.8927 ± 0.0163	0.6942 ± 0.0397	0.4641 ± 0.0082	0.5022 ± 0.0136	0.7467 ± 0.0041	0.1195 ± 0.0026	0.8564 ± 0.0265	0.5626
MLM + DAPT + TAFT OPT	0.4248 ± 0.0041	0.4403 ± 0.0046	0.5025 ± 0.0029	0.8901 ± 0.0123	0.7268 ± 0.0090	0.4627 ± 0.0083	0.4932 ± 0.0079	0.7596 ± 0.0044	0.1150 ± 0.0002	0.8735 ± 0.0053	0.5689

Bold indicates the best result in the column; italic indicates the best result among ModChemBERT checkpoints.
* Published results from the ChemBERTa-3 [7] paper for optimized chemical language models using DeepChem scaffold splits.
† AVG column shows the mean score across classification tasks.
‡ AVG column shows the mean scores across regression tasks without and with the clearance score.

Optimized ModChemBERT Hyperparameters

Click to expand

TAFT Datasets

Optimal parameters (per dataset) for the MLM + DAPT + TAFT OPT merged model:

Dataset	Learning Rate	Batch Size	Warmup Ratio	Classifier Pooling	Last k Layers
adme_microsom_stab_h	3e-5	8	0.0	max_seq_mean	5
adme_microsom_stab_r	3e-5	16	0.2	max_cls	3
adme_permeability	3e-5	8	0.0	max_cls	3
adme_ppb_h	1e-5	32	0.1	max_seq_mean	5
adme_ppb_r	1e-5	32	0.0	sum_mean	N/A
adme_solubility	3e-5	32	0.0	sum_mean	N/A
astrazeneca_CL	3e-5	8	0.1	max_seq_mha	3
astrazeneca_LogD74	1e-5	8	0.0	max_seq_mean	5
astrazeneca_PPB	1e-5	32	0.0	max_cls	3
astrazeneca_Solubility	1e-5	32	0.0	max_seq_mean	5

Benchmarking Datasets

Optimal parameters (per dataset) for the MLM + DAPT + TAFT OPT merged model:

Dataset	Batch Size	Classifier Pooling	Last k Layers	Pooling Attention Dropout	Classifier Dropout	Embedding Dropout
bace_classification	32	max_seq_mha	3	0.0	0.0	0.0
bbbp	64	max_cls	3	0.1	0.0	0.0
clintox	32	max_seq_mha	5	0.1	0.0	0.0
hiv	32	max_seq_mha	3	0.0	0.0	0.0
sider	32	mean	N/A	0.1	0.0	0.1
tox21	32	max_seq_mha	5	0.1	0.0	0.0
base_regression	32	max_seq_mha	5	0.1	0.0	0.0
clearance	32	max_seq_mha	5	0.1	0.0	0.0
esol	64	sum_mean	N/A	0.1	0.0	0.1
freesolv	32	max_seq_mha	5	0.1	0.0	0.0
lipo	32	max_seq_mha	3	0.1	0.1	0.1
antimalarial	16	max_seq_mha	3	0.1	0.1	0.1
cocrystal	16	max_cls	3	0.1	0.0	0.1
covid19	16	sum_mean	N/A	0.1	0.0	0.1

Intended Use

Primary: Research and development for molecular property prediction, experimentation with pooling strategies, and as a foundational model for downstream applications.
Appropriate for: Binary / multi-class classification (e.g., toxicity, activity) and single-task or multi-task regression (e.g., solubility, clearance) after fine-tuning.
Not intended for generating novel molecules.

Limitations

Out-of-domain performance may degrade for: very long (>128 token) SMILES, inorganic / organometallic compounds, polymers, or charged / enumerated tautomers are not well represented in training.
No guarantee of synthesizability, safety, or biological efficacy.

Ethical Considerations & Responsible Use

Potential biases arise from training corpora skewed to drug-like space.
Do not deploy in clinical or regulatory settings without rigorous, domain-specific validation.

Hardware

Training and experiments were performed on 2 NVIDIA RTX 3090 GPUs.

Citation

If you use ModChemBERT in your research, please cite the checkpoint and the following:

@software{cortes-2025-modchembert,
  author = {Emmanuel Cortes},
  title = {ModChemBERT: ModernBERT as a Chemical Language Model},
  year = {2025},
  publisher = {GitHub},
  howpublished = {GitHub repository},
  url = {https://github.com/emapco/ModChemBERT}
}

References

Kallergis, G., Asgari, E., Empting, M. et al. Domain adaptable language modeling of chemical compounds identifies potent pathoblockers for Pseudomonas aeruginosa. Commun Chem 8, 114 (2025). https://doi.org/10.1038/s42004-025-01484-4
Behrendt, Maike, Stefan Sylvius Wagner, and Stefan Harmeling. "MaxPoolBERT: Enhancing BERT Classification via Layer-and Token-Wise Aggregation." arXiv preprint arXiv:2505.15696 (2025).
Sultan, Afnan, et al. "Transformers for molecular property prediction: Domain adaptation efficiently improves performance." arXiv preprint arXiv:2503.03360 (2025).
Warner, Benjamin, et al. "Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference." arXiv preprint arXiv:2412.13663 (2024).
Clavié, Benjamin. "JaColBERTv2.5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources." arXiv preprint arXiv:2407.20750 (2024).
Grattafiori, Aaron, et al. "The llama 3 herd of models." arXiv preprint arXiv:2407.21783 (2024).
Singh R, Barsainyan AA, Irfan R, Amorin CJ, He S, Davis T, et al. ChemBERTa-3: An Open Source Training Framework for Chemical Foundation Models. ChemRxiv. 2025; doi:10.26434/chemrxiv-2025-4glrl-v2 This content is a preprint and has not been peer-reviewed.
Mswahili, M.E., Hwang, J., Rajapakse, J.C. et al. Positional embeddings and zero-shot learning using BERT for molecular-property prediction. J Cheminform 17, 17 (2025). https://doi.org/10.1186/s13321-025-00959-9
Mswahili, M.E.; Ndomba, G.E.; Jo, K.; Jeong, Y.-S. Graph Neural Network and BERT Model for Antimalarial Drug Predictions Using Plasmodium Potential Targets. Applied Sciences, 2024, 14(4), 1472. https://doi.org/10.3390/app14041472
Mswahili, M.E.; Lee, M.-J.; Martin, G.L.; Kim, J.; Kim, P.; Choi, G.J.; Jeong, Y.-S. Cocrystal Prediction Using Machine Learning Models and Descriptors. Applied Sciences, 2021, 11, 1323. https://doi.org/10.3390/app11031323
Harigua-Souiai, E.; Heinhane, M.M.; Abdelkrim, Y.Z.; Souiai, O.; Abdeljaoued-Tej, I.; Guizani, I. Deep Learning Algorithms Achieved Satisfactory Predictions When Trained on a Novel Collection of Anticoronavirus Molecules. Frontiers in Genetics, 2021, 12:744170. https://doi.org/10.3389/fgene.2021.744170
Cheng Fang, Ye Wang, Richard Grater, Sudarshan Kapadnis, Cheryl Black, Patrick Trapa, and Simone Sciabola. "Prospective Validation of Machine Learning Algorithms for Absorption, Distribution, Metabolism, and Excretion Prediction: An Industrial Perspective" Journal of Chemical Information and Modeling 2023 63 (11), 3263-3274 https://doi.org/10.1021/acs.jcim.3c00160