ModChemBERT / README.md

Update README and add additional benchmarking logs

d6ebf8d verified 9 days ago

29.4 kB

	---
	license: apache-2.0
	base_model: Derify/ModChemBERT-MLM-DAPT
	datasets:
	- Derify/augmented_canonical_druglike_QED_Pfizer_15M
	metrics:
	- roc_auc
	- rmse
	library_name: transformers
	tags:
	- modernbert
	- ModChemBERT
	- cheminformatics
	- chemical-language-model
	- molecular-property-prediction
	- mergekit
	- merge
	pipeline_tag: fill-mask
	model-index:
	- name: Derify/ModChemBERT-MLM
	results:
	- task:
	type: text-classification
	name: Classification (ROC AUC)
	dataset:
	name: BACE
	type: BACE
	metrics:
	- type: roc_auc
	value: 0.8346
	- task:
	type: text-classification
	name: Classification (ROC AUC)
	dataset:
	name: BBBP
	type: BBBP
	metrics:
	- type: roc_auc
	value: 0.7573
	- task:
	type: text-classification
	name: Classification (ROC AUC)
	dataset:
	name: CLINTOX
	type: CLINTOX
	metrics:
	- type: roc_auc
	value: 0.9938
	- task:
	type: text-classification
	name: Classification (ROC AUC)
	dataset:
	name: HIV
	type: HIV
	metrics:
	- type: roc_auc
	value: 0.7737
	- task:
	type: text-classification
	name: Classification (ROC AUC)
	dataset:
	name: SIDER
	type: SIDER
	metrics:
	- type: roc_auc
	value: 0.6600
	- task:
	type: text-classification
	name: Classification (ROC AUC)
	dataset:
	name: TOX21
	type: TOX21
	metrics:
	- type: roc_auc
	value: 0.7518
	- task:
	type: regression
	name: Regression (RMSE)
	dataset:
	name: BACE
	type: BACE
	metrics:
	- type: rmse
	value: 0.9665
	- task:
	type: regression
	name: Regression (RMSE)
	dataset:
	name: CLEARANCE
	type: CLEARANCE
	metrics:
	- type: rmse
	value: 44.0137
	- task:
	type: regression
	name: Regression (RMSE)
	dataset:
	name: ESOL
	type: ESOL
	metrics:
	- type: rmse
	value: 0.8158
	- task:
	type: regression
	name: Regression (RMSE)
	dataset:
	name: FREESOLV
	type: FREESOLV
	metrics:
	- type: rmse
	value: 0.4979
	- task:
	type: regression
	name: Regression (RMSE)
	dataset:
	name: LIPO
	type: LIPO
	metrics:
	- type: rmse
	value: 0.6505
	- task:
	type: text-classification
	name: Classification (ROC AUC)
	dataset:
	name: Antimalarial
	type: Antimalarial
	metrics:
	- type: roc_auc
	value: 0.8966
	- task:
	type: text-classification
	name: Classification (ROC AUC)
	dataset:
	name: Cocrystal
	type: Cocrystal
	metrics:
	- type: roc_auc
	value: 0.8654
	- task:
	type: text-classification
	name: Classification (ROC AUC)
	dataset:
	name: COVID19
	type: COVID19
	metrics:
	- type: roc_auc
	value: 0.8132
	- task:
	type: regression
	name: Regression (RMSE)
	dataset:
	name: ADME microsom stab human
	type: ADME
	metrics:
	- type: rmse
	value: 0.4248
	- task:
	type: regression
	name: Regression (RMSE)
	dataset:
	name: ADME microsom stab rat
	type: ADME
	metrics:
	- type: rmse
	value: 0.4403
	- task:
	type: regression
	name: Regression (RMSE)
	dataset:
	name: ADME permeability
	type: ADME
	metrics:
	- type: rmse
	value: 0.5025
	- task:
	type: regression
	name: Regression (RMSE)
	dataset:
	name: ADME ppb human
	type: ADME
	metrics:
	- type: rmse
	value: 0.8901
	- task:
	type: regression
	name: Regression (RMSE)
	dataset:
	name: ADME ppb rat
	type: ADME
	metrics:
	- type: rmse
	value: 0.7268
	- task:
	type: regression
	name: Regression (RMSE)
	dataset:
	name: ADME solubility
	type: ADME
	metrics:
	- type: rmse
	value: 0.4627
	- task:
	type: regression
	name: Regression (RMSE)
	dataset:
	name: AstraZeneca CL
	type: AstraZeneca
	metrics:
	- type: rmse
	value: 0.4932
	- task:
	type: regression
	name: Regression (RMSE)
	dataset:
	name: AstraZeneca LogD74
	type: AstraZeneca
	metrics:
	- type: rmse
	value: 0.7596
	- task:
	type: regression
	name: Regression (RMSE)
	dataset:
	name: AstraZeneca PPB
	type: AstraZeneca
	metrics:
	- type: rmse
	value: 0.1150
	- task:
	type: regression
	name: Regression (RMSE)
	dataset:
	name: AstraZeneca Solubility
	type: AstraZeneca
	metrics:
	- type: rmse
	value: 0.8735
	---

	# ModChemBERT: ModernBERT as a Chemical Language Model
	ModChemBERT is a ModernBERT-based chemical language model (CLM), trained on SMILES strings for masked language modeling (MLM) and downstream molecular property prediction (classification & regression).

	## Usage
	Install the `transformers` library starting from v4.56.1:

	```bash
	pip install -U transformers>=4.56.1
	```

	### Load Model
	```python
	from transformers import AutoModelForMaskedLM, AutoTokenizer

	model_id = "Derify/ModChemBERT"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForMaskedLM.from_pretrained(
	model_id,
	trust_remote_code=True,
	dtype="float16",
	device_map="auto",
	)
	```

	### Fill-Mask Pipeline
	```python
	from transformers import pipeline

	fill = pipeline("fill-mask", model=model, tokenizer=tokenizer)
	print(fill("c1ccccc1[MASK]"))
	```

	## Architecture
	- Backbone: ModernBERT
	- Hidden size: 768
	- Intermediate size: 1152
	- Encoder Layers: 22
	- Attention heads: 12
	- Max sequence length: 256 tokens (MLM primarily trained with 128-token sequences)
	- Tokenizer: BPE tokenizer using [MolFormer's vocab](https://github.com/emapco/ModChemBERT/blob/main/modchembert/tokenizers/molformer/vocab.json) (2362 tokens)

	## Pooling (Classifier / Regressor Head)
	Kallergis et al. [1] demonstrated that the CLM embedding method prior to the prediction head was the strongest contributor to downstream performance among evaluated hyperparameters.

	Behrendt et al. [2] noted that the last few layers contain task-specific information and that pooling methods leveraging information from multiple layers can enhance model performance. Their results further demonstrated that the `max_seq_mha` pooling method was particularly effective in low-data regimes, which is often the case for molecular property prediction tasks.

	Multiple pooling strategies are supported by ModChemBERT to explore their impact on downstream performance:
	- `cls`: Last layer [CLS]
	- `mean`: Mean over last hidden layer
	- `max_cls`: Max over last k layers of [CLS]
	- `cls_mha`: MHA with [CLS] as query
	- `max_seq_mha`: MHA with max pooled sequence as KV and max pooled [CLS] as query
	- `sum_mean`: Sum over all layers then mean tokens
	- `sum_sum`: Sum over all layers then sum tokens
	- `mean_mean`: Mean over all layers then mean tokens
	- `mean_sum`: Mean over all layers then sum tokens
	- `max_seq_mean`: Max over last k layers then mean tokens

	Note: ModChemBERT’s `max_seq_mha` differs from MaxPoolBERT [2]. MaxPoolBERT uses PyTorch `nn.MultiheadAttention`, whereas ModChemBERT's `ModChemBertPoolingAttention` adapts ModernBERT’s `ModernBertAttention`.
	On ChemBERTa-3 benchmarks this variant produced stronger validation metrics and avoided the training instabilities (sporadic zero / NaN losses and gradient norms) seen with `nn.MultiheadAttention`. Training instability with ModernBERT has been reported in the past ([discussion 1](https://huggingface.co/answerdotai/ModernBERT-base/discussions/59) and [discussion 2](https://huggingface.co/answerdotai/ModernBERT-base/discussions/63)).

	## Training Pipeline
	<div align="center">
	<img src="https://cdn-uploads.huggingface.co/production/uploads/656892962693fa22e18b5331/bxNbpgMkU8m60ypyEJoWQ.png" alt="ModChemBERT Training Pipeline" width="650"/>
	</div>

	### Rationale for MTR Stage
	Following Sultan et al. [3], multi-task regression (physicochemical properties) biases the latent space toward ADME-related representations prior to narrow TAFT specialization. Sultan et al. observed that MLM + DAPT (MTR) outperforms MLM-only, MTR-only, and MTR + DAPT (MTR).

	### Checkpoint Averaging Motivation
	Inspired by ModernBERT [4], JaColBERTv2.5 [5], and Llama 3.1 [6], where results show that model merging can enhance generalization or performance while mitigating overfitting to any single fine-tune or annealing checkpoint.

	## Datasets
	- Pretraining: [Derify/augmented_canonical_druglike_QED_Pfizer_15M](https://huggingface.co/datasets/Derify/augmented_canonical_druglike_QED_Pfizer_15M) (canonical_smiles column)
	- Domain Adaptive Pretraining (DAPT) & Task Adaptive Fine-tuning (TAFT): ADME (6 tasks) + AstraZeneca (4 tasks) datasets that are split using DA4MT's [3] Bemis-Murcko scaffold splitter (see [domain-adaptation-molecular-transformers](https://github.com/emapco/ModChemBERT/blob/main/domain-adaptation-molecular-transformers/da4mt/splitting.py))
	- Benchmarking:
	- ChemBERTa-3 [7]
	- classification: BACE, BBBP, TOX21, HIV, SIDER, CLINTOX
	- regression: ESOL, FREESOLV, LIPO, BACE, CLEARANCE
	- Mswahili, et al. [8] proposed additional datasets for benchmarking chemical language models:
	- classification: Antimalarial [9], Cocrystal [10], COVID19 [11]
	- DAPT/TAFT stage regression datasets:
	- ADME [12]: adme_microsom_stab_h, adme_microsom_stab_r, adme_permeability, adme_ppb_h, adme_ppb_r, adme_solubility
	- AstraZeneca: astrazeneca_CL, astrazeneca_LogD74, astrazeneca_PPB, astrazeneca_Solubility

	## Benchmarking
	Benchmarks were conducted using the ChemBERTa-3 framework. DeepChem scaffold splits were utilized for all datasets, with the exception of the Antimalarial dataset, which employed a random split. Each task was trained for 100 epochs, with results averaged across 3 random seeds.

	The complete hyperparameter configurations for these benchmarks are available here: [ChemBERTa3 configs](https://github.com/emapco/ModChemBERT/tree/main/conf/chemberta3)

	### Evaluation Methodology
	- Classification Metric: ROC AUC
	- Regression Metric: RMSE
	- Aggregation: Mean ± standard deviation of the triplicate results.
	- Input Constraints: SMILES truncated / filtered to ≤200 tokens, following ChemBERTa-3's recommendation.

	### Results
	<details><summary>Click to expand</summary>

	#### ChemBERTa-3 Classification Datasets (ROC AUC - Higher is better)

	\| Model \| BACE↑ \| BBBP↑ \| CLINTOX↑ \| HIV↑ \| SIDER↑ \| TOX21↑ \| AVG† \|
	\| ---------------------------------------------------------------------------- \| ----------------- \| ----------------- \| --------------------- \| --------------------- \| --------------------- \| ----------------- \| ------ \|
	\| Tasks \| 1 \| 1 \| 2 \| 1 \| 27 \| 12 \| \|
	\| [ChemBERTa-100M-MLM](https://huggingface.co/DeepChem/ChemBERTa-100M-MLM)* \| 0.781 ± 0.019 \| 0.700 ± 0.027 \| 0.979 ± 0.022 \| 0.740 ± 0.013 \| 0.611 ± 0.002 \| 0.718 ± 0.011 \| 0.7548 \|
	\| [c3-MoLFormer-1.1B](https://huggingface.co/DeepChem/MoLFormer-c3-1.1B)* \| 0.819 ± 0.019 \| 0.735 ± 0.019 \| 0.839 ± 0.013 \| 0.762 ± 0.005 \| 0.618 ± 0.005 \| 0.723 ± 0.012 \| 0.7493 \|
	\| MoLFormer-LHPC* \| 0.887 ± 0.004 \| 0.908 ± 0.013 \| 0.993 ± 0.004 \| 0.750 ± 0.003 \| 0.622 ± 0.007 \| 0.791 ± 0.014 \| 0.8252 \|
	\| \| \| \| \| \| \| \| \|
	\| [MLM](https://huggingface.co/Derify/ModChemBERT-MLM) \| 0.8065 ± 0.0103 \| 0.7222 ± 0.0150 \| 0.9709 ± 0.0227 \| *0.7800 ± 0.0133* \| 0.6419 ± 0.0113 \| 0.7400 ± 0.0044 \| 0.7769 \|
	\| [MLM + DAPT](https://huggingface.co/Derify/ModChemBERT-MLM-DAPT) \| 0.8224 ± 0.0156 \| 0.7402 ± 0.0095 \| 0.9820 ± 0.0138 \| 0.7702 ± 0.0020 \| 0.6303 ± 0.0039 \| 0.7360 ± 0.0036 \| 0.7802 \|
	\| [MLM + TAFT](https://huggingface.co/Derify/ModChemBERT-MLM-TAFT) \| 0.7924 ± 0.0155 \| 0.7282 ± 0.0058 \| 0.9725 ± 0.0213 \| 0.7770 ± 0.0047 \| 0.6542 ± 0.0128 \| 0.7646 ± 0.0039 \| 0.7815 \|
	\| [MLM + DAPT + TAFT](https://huggingface.co/Derify/ModChemBERT-MLM-DAPT-TAFT) \| 0.8213 ± 0.0051 \| 0.7356 ± 0.0094 \| 0.9664 ± 0.0202 \| 0.7750 ± 0.0048 \| 0.6415 ± 0.0094 \| 0.7263 ± 0.0036 \| 0.7777 \|
	\| [MLM + DAPT + TAFT OPT](https://huggingface.co/Derify/ModChemBERT) \| 0.8346 ± 0.0045 \| 0.7573 ± 0.0120 \| *0.9938 ± 0.0017* \| 0.7737 ± 0.0034 \| *0.6600 ± 0.0061* \| 0.7518 ± 0.0047 \| 0.7952 \|

	#### ChemBERTa-3 Regression Datasets (RMSE - Lower is better)

	\| Model \| BACE↓ \| CLEARANCE↓ \| ESOL↓ \| FREESOLV↓ \| LIPO↓ \| AVG‡ \|
	\| ---------------------------------------------------------------------------- \| --------------------- \| ---------------------- \| --------------------- \| --------------------- \| --------------------- \| ---------------- \|
	\| Tasks \| 1 \| 1 \| 1 \| 1 \| 1 \| \|
	\| [ChemBERTa-100M-MLM](https://huggingface.co/DeepChem/ChemBERTa-100M-MLM)* \| 1.011 ± 0.038 \| 51.582 ± 3.079 \| 0.920 ± 0.011 \| 0.536 ± 0.016 \| 0.758 ± 0.013 \| 0.8063 / 10.9614 \|
	\| [c3-MoLFormer-1.1B](https://huggingface.co/DeepChem/MoLFormer-c3-1.1B)* \| 1.094 ± 0.126 \| 52.058 ± 2.767 \| 0.829 ± 0.019 \| 0.572 ± 0.023 \| 0.728 ± 0.016 \| 0.8058 / 11.0562 \|
	\| MoLFormer-LHPC* \| 1.201 ± 0.100 \| 45.74 ± 2.637 \| 0.848 ± 0.031 \| 0.683 ± 0.040 \| 0.895 ± 0.080 \| 0.9068 / 9.8734 \|
	\| \| \| \| \| \| \|
	\| [MLM](https://huggingface.co/Derify/ModChemBERT-MLM) \| 1.0893 ± 0.1319 \| 49.0005 ± 1.2787 \| 0.8456 ± 0.0406 \| 0.5491 ± 0.0134 \| 0.7147 ± 0.0062 \| 0.7997 / 10.4398 \|
	\| [MLM + DAPT](https://huggingface.co/Derify/ModChemBERT-MLM-DAPT) \| 0.9931 ± 0.0258 \| 45.4951 ± 0.7112 \| 0.9319 ± 0.0153 \| 0.6049 ± 0.0666 \| 0.6874 ± 0.0040 \| 0.8043 / 9.7425 \|
	\| [MLM + TAFT](https://huggingface.co/Derify/ModChemBERT-MLM-TAFT) \| 1.0304 ± 0.1146 \| 47.8418 ± 0.4070 \| *0.7669 ± 0.0024* \| 0.5293 ± 0.0267 \| 0.6708 ± 0.0074 \| 0.7493 / 10.1678 \|
	\| [MLM + DAPT + TAFT](https://huggingface.co/Derify/ModChemBERT-MLM-DAPT-TAFT) \| 0.9713 ± 0.0224 \| *42.8010 ± 3.3475* \| 0.8169 ± 0.0268 \| 0.5445 ± 0.0257 \| 0.6820 ± 0.0028 \| 0.7537 / 9.1631 \|
	\| [MLM + DAPT + TAFT OPT](https://huggingface.co/Derify/ModChemBERT) \| *0.9665 ± 0.0250* \| 44.0137 ± 1.1110 \| 0.8158 ± 0.0115 \| *0.4979 ± 0.0158* \| *0.6505 ± 0.0126* \| 0.7327 / 9.3889 \|

	#### Mswahili, et al. [8] Proposed Classification Datasets (ROC AUC - Higher is better)

	\| Model \| Antimalarial↑ \| Cocrystal↑ \| COVID19↑ \| AVG† \|
	\| ---------------------------------------------------------------------------- \| --------------------- \| --------------------- \| --------------------- \| ------ \|
	\| Tasks \| 1 \| 1 \| 1 \| \|
	\| [MLM](https://huggingface.co/Derify/ModChemBERT-MLM) \| 0.8707 ± 0.0032 \| 0.7967 ± 0.0124 \| 0.8106 ± 0.0170 \| 0.8260 \|
	\| [MLM + DAPT](https://huggingface.co/Derify/ModChemBERT-MLM-DAPT) \| 0.8756 ± 0.0056 \| 0.8288 ± 0.0143 \| 0.8029 ± 0.0159 \| 0.8358 \|
	\| [MLM + TAFT](https://huggingface.co/Derify/ModChemBERT-MLM-TAFT) \| 0.8832 ± 0.0051 \| 0.7866 ± 0.0204 \| *0.8308 ± 0.0026* \| 0.8335 \|
	\| [MLM + DAPT + TAFT](https://huggingface.co/Derify/ModChemBERT-MLM-DAPT-TAFT) \| 0.8819 ± 0.0052 \| 0.8550 ± 0.0106 \| 0.8013 ± 0.0118 \| 0.8461 \|
	\| [MLM + DAPT + TAFT OPT](https://huggingface.co/Derify/ModChemBERT) \| *0.8966 ± 0.0045* \| *0.8654 ± 0.0080* \| 0.8132 ± 0.0195 \| 0.8584 \|

	#### ADME/AstraZeneca Regression Datasets (RMSE - Lower is better)

	Hyperparameter optimization for the TAFT stage appears to induce overfitting, as the `MLM + DAPT + TAFT OPT` model shows slightly degraded performance on the ADME/AstraZeneca datasets compared to the `MLM + DAPT + TAFT` model.
	The `MLM + DAPT + TAFT` model, a merge of unoptimized TAFT checkpoints trained with `max_seq_mean` pooling, achieved the best overall performance across the ADME/AstraZeneca datasets.

	\| \| ADME \| \| \| \| \| \| AstraZeneca \| \| \| \| \|
	\| ---------------------------------------------------------------------------- \| ------------------- \| ------------------- \| ------------------- \| ------------------- \| ------------------- \| ------------------- \| ------------------- \| ------------------- \| ------------------- \| ------------------- \| ------ \|
	\| Model \| microsom_stab_h↓ \| microsom_stab_r↓ \| permeability↓ \| ppb_h↓ \| ppb_r↓ \| solubility↓ \| CL↓ \| LogD74↓ \| PPB↓ \| Solubility↓ \| AVG† \|
	\| \| \| \| \| \| \| \| \| \| \| \|
	\| Tasks \| 1 \| 1 \| 1 \| 1 \| 1 \| 1 \| 1 \| 1 \| 1 \| 1 \| \|
	\| [MLM](https://huggingface.co/Derify/ModChemBERT-MLM) \| 0.4489 ± 0.0114 \| 0.4685 ± 0.0225 \| 0.5423 ± 0.0076 \| 0.8041 ± 0.0378 \| 0.7849 ± 0.0394 \| 0.5191 ± 0.0147 \| 0.4812 ± 0.0073 \| 0.8204 ± 0.0070 \| 0.1365 ± 0.0066 \| 0.9614 ± 0.0189 \| 0.5967 \|
	\| [MLM + DAPT](https://huggingface.co/Derify/ModChemBERT-MLM-DAPT) \| 0.4199 ± 0.0064 \| 0.4568 ± 0.0091 \| 0.5042 ± 0.0135 \| 0.8376 ± 0.0629 \| 0.8446 ± 0.0756 \| 0.4800 ± 0.0118 \| 0.5351 ± 0.0036 \| 0.8191 ± 0.0066 \| 0.1237 ± 0.0022 \| 0.9280 ± 0.0088 \| 0.5949 \|
	\| [MLM + TAFT](https://huggingface.co/Derify/ModChemBERT-MLM-TAFT) \| 0.4375 ± 0.0027 \| 0.4542 ± 0.0024 \| 0.5202 ± 0.0141 \| 0.7618 ± 0.0138 \| 0.7027 ± 0.0023 \| 0.5023 ± 0.0107 \| 0.5104 ± 0.0110 \| 0.7599 ± 0.0050 \| 0.1233 ± 0.0088 \| 0.8730 ± 0.0112 \| 0.5645 \|
	\| [MLM + DAPT + TAFT](https://huggingface.co/Derify/ModChemBERT-MLM-DAPT-TAFT) \| 0.4206 ± 0.0071 \| 0.4400 ± 0.0039 \| 0.4899 ± 0.0068 \| 0.8927 ± 0.0163 \| 0.6942 ± 0.0397 \| 0.4641 ± 0.0082 \| 0.5022 ± 0.0136 \| 0.7467 ± 0.0041 \| 0.1195 ± 0.0026 \| 0.8564 ± 0.0265 \| 0.5626 \|
	\| [MLM + DAPT + TAFT OPT](https://huggingface.co/Derify/ModChemBERT) \| 0.4248 ± 0.0041 \| 0.4403 ± 0.0046 \| 0.5025 ± 0.0029 \| 0.8901 ± 0.0123 \| 0.7268 ± 0.0090 \| 0.4627 ± 0.0083 \| 0.4932 ± 0.0079 \| 0.7596 ± 0.0044 \| 0.1150 ± 0.0002 \| 0.8735 ± 0.0053 \| 0.5689 \|


	Bold indicates the best result in the column; italic indicates the best result among ModChemBERT checkpoints.<br/>
	\* Published results from the ChemBERTa-3 [7] paper for optimized chemical language models using DeepChem scaffold splits.<br/>
	† AVG column shows the mean score across classification tasks.<br/>
	‡ AVG column shows the mean scores across regression tasks without and with the clearance score.

	</details>

	## Optimized ModChemBERT Hyperparameters

	<details><summary>Click to expand</summary>

	### TAFT Datasets
	Optimal parameters (per dataset) for the `MLM + DAPT + TAFT OPT` merged model:

	\| Dataset \| Learning Rate \| Batch Size \| Warmup Ratio \| Classifier Pooling \| Last k Layers \|
	\| ---------------------- \| ------------- \| ---------- \| ------------ \| ------------------ \| ------------- \|
	\| adme_microsom_stab_h \| 3e-5 \| 8 \| 0.0 \| max_seq_mean \| 5 \|
	\| adme_microsom_stab_r \| 3e-5 \| 16 \| 0.2 \| max_cls \| 3 \|
	\| adme_permeability \| 3e-5 \| 8 \| 0.0 \| max_cls \| 3 \|
	\| adme_ppb_h \| 1e-5 \| 32 \| 0.1 \| max_seq_mean \| 5 \|
	\| adme_ppb_r \| 1e-5 \| 32 \| 0.0 \| sum_mean \| N/A \|
	\| adme_solubility \| 3e-5 \| 32 \| 0.0 \| sum_mean \| N/A \|
	\| astrazeneca_CL \| 3e-5 \| 8 \| 0.1 \| max_seq_mha \| 3 \|
	\| astrazeneca_LogD74 \| 1e-5 \| 8 \| 0.0 \| max_seq_mean \| 5 \|
	\| astrazeneca_PPB \| 1e-5 \| 32 \| 0.0 \| max_cls \| 3 \|
	\| astrazeneca_Solubility \| 1e-5 \| 32 \| 0.0 \| max_seq_mean \| 5 \|

	### Benchmarking Datasets
	Optimal parameters (per dataset) for the `MLM + DAPT + TAFT OPT` merged model:

	\| Dataset \| Batch Size \| Classifier Pooling \| Last k Layers \| Pooling Attention Dropout \| Classifier Dropout \| Embedding Dropout \|
	\| ------------------- \| ---------- \| ------------------ \| ------------- \| ------------------------- \| ------------------ \| ----------------- \|
	\| bace_classification \| 32 \| max_seq_mha \| 3 \| 0.0 \| 0.0 \| 0.0 \|
	\| bbbp \| 64 \| max_cls \| 3 \| 0.1 \| 0.0 \| 0.0 \|
	\| clintox \| 32 \| max_seq_mha \| 5 \| 0.1 \| 0.0 \| 0.0 \|
	\| hiv \| 32 \| max_seq_mha \| 3 \| 0.0 \| 0.0 \| 0.0 \|
	\| sider \| 32 \| mean \| N/A \| 0.1 \| 0.0 \| 0.1 \|
	\| tox21 \| 32 \| max_seq_mha \| 5 \| 0.1 \| 0.0 \| 0.0 \|
	\| base_regression \| 32 \| max_seq_mha \| 5 \| 0.1 \| 0.0 \| 0.0 \|
	\| clearance \| 32 \| max_seq_mha \| 5 \| 0.1 \| 0.0 \| 0.0 \|
	\| esol \| 64 \| sum_mean \| N/A \| 0.1 \| 0.0 \| 0.1 \|
	\| freesolv \| 32 \| max_seq_mha \| 5 \| 0.1 \| 0.0 \| 0.0 \|
	\| lipo \| 32 \| max_seq_mha \| 3 \| 0.1 \| 0.1 \| 0.1 \|
	\| antimalarial \| 16 \| max_seq_mha \| 3 \| 0.1 \| 0.1 \| 0.1 \|
	\| cocrystal \| 16 \| max_cls \| 3 \| 0.1 \| 0.0 \| 0.1 \|
	\| covid19 \| 16 \| sum_mean \| N/A \| 0.1 \| 0.0 \| 0.1 \|

	</details>

	## Intended Use
	* Primary: Research and development for molecular property prediction, experimentation with pooling strategies, and as a foundational model for downstream applications.
	* Appropriate for: Binary / multi-class classification (e.g., toxicity, activity) and single-task or multi-task regression (e.g., solubility, clearance) after fine-tuning.
	* Not intended for generating novel molecules.

	## Limitations
	- Out-of-domain performance may degrade for: very long (>128 token) SMILES, inorganic / organometallic compounds, polymers, or charged / enumerated tautomers are not well represented in training.
	- No guarantee of synthesizability, safety, or biological efficacy.

	## Ethical Considerations & Responsible Use
	- Potential biases arise from training corpora skewed to drug-like space.
	- Do not deploy in clinical or regulatory settings without rigorous, domain-specific validation.

	## Hardware
	Training and experiments were performed on 2 NVIDIA RTX 3090 GPUs.

	## Citation
	If you use ModChemBERT in your research, please cite the checkpoint and the following:
	```
	@software{cortes-2025-modchembert,
	author = {Emmanuel Cortes},
	title = {ModChemBERT: ModernBERT as a Chemical Language Model},
	year = {2025},
	publisher = {GitHub},
	howpublished = {GitHub repository},
	url = {https://github.com/emapco/ModChemBERT}
	}
	```

	## References
	1. Kallergis, G., Asgari, E., Empting, M. et al. Domain adaptable language modeling of chemical compounds identifies potent pathoblockers for Pseudomonas aeruginosa. Commun Chem 8, 114 (2025). https://doi.org/10.1038/s42004-025-01484-4
	2. Behrendt, Maike, Stefan Sylvius Wagner, and Stefan Harmeling. "MaxPoolBERT: Enhancing BERT Classification via Layer-and Token-Wise Aggregation." arXiv preprint arXiv:2505.15696 (2025).
	3. Sultan, Afnan, et al. "Transformers for molecular property prediction: Domain adaptation efficiently improves performance." arXiv preprint arXiv:2503.03360 (2025).
	4. Warner, Benjamin, et al. "Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference." arXiv preprint arXiv:2412.13663 (2024).
	5. Clavié, Benjamin. "JaColBERTv2.5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources." arXiv preprint arXiv:2407.20750 (2024).
	6. Grattafiori, Aaron, et al. "The llama 3 herd of models." arXiv preprint arXiv:2407.21783 (2024).
	7. Singh R, Barsainyan AA, Irfan R, Amorin CJ, He S, Davis T, et al. ChemBERTa-3: An Open Source Training Framework for Chemical Foundation Models. ChemRxiv. 2025; doi:10.26434/chemrxiv-2025-4glrl-v2 This content is a preprint and has not been peer-reviewed.
	8. Mswahili, M.E., Hwang, J., Rajapakse, J.C. et al. Positional embeddings and zero-shot learning using BERT for molecular-property prediction. J Cheminform 17, 17 (2025). https://doi.org/10.1186/s13321-025-00959-9
	9. Mswahili, M.E.; Ndomba, G.E.; Jo, K.; Jeong, Y.-S. Graph Neural Network and BERT Model for Antimalarial Drug Predictions Using Plasmodium Potential Targets. Applied Sciences, 2024, 14(4), 1472. https://doi.org/10.3390/app14041472
	10. Mswahili, M.E.; Lee, M.-J.; Martin, G.L.; Kim, J.; Kim, P.; Choi, G.J.; Jeong, Y.-S. Cocrystal Prediction Using Machine Learning Models and Descriptors. Applied Sciences, 2021, 11, 1323. https://doi.org/10.3390/app11031323
	11. Harigua-Souiai, E.; Heinhane, M.M.; Abdelkrim, Y.Z.; Souiai, O.; Abdeljaoued-Tej, I.; Guizani, I. Deep Learning Algorithms Achieved Satisfactory Predictions When Trained on a Novel Collection of Anticoronavirus Molecules. Frontiers in Genetics, 2021, 12:744170. https://doi.org/10.3389/fgene.2021.744170
	12. Cheng Fang, Ye Wang, Richard Grater, Sudarshan Kapadnis, Cheryl Black, Patrick Trapa, and Simone Sciabola. "Prospective Validation of Machine Learning Algorithms for Absorption, Distribution, Metabolism, and Excretion Prediction: An Industrial Perspective" Journal of Chemical Information and Modeling 2023 63 (11), 3263-3274 https://doi.org/10.1021/acs.jcim.3c00160