Tox21-v4 — VIDRAFT Aether PharmaOS Toxicity Models

12-endpoint Tox21 toxicity prediction models, de novo trained by VIDRAFT Aether PharmaOS. Successor to VIDraft/tox21-v3-models, adding Mordred + Uni-Mol features and a multitask DNN stacker.

Architecture

Per-task ensemble: XGBoost ×30 seeds + LightGBM ×20 seeds, with Optuna 100-trial hyperparameter search (5-fold CV per task) → 12 task .pkl files.
Multitask DNN stacker (PyTorch, 120 epochs) → dnn_final.pt / dnn_best.pt.
Features (10,469-dim): Morgan/FCFP fingerprints (8,390) + Mordred descriptors (1,567) + Uni-Mol embeddings (512).
Preprocessing: scaler.pkl, mordred_imputer.pkl. Config: v4_info.json.
Training set: 7,823 molecules (Tox21).

Endpoints (12)

NR-AhR · NR-AR · NR-AR-LBD · NR-Aromatase · NR-ER · NR-ER-LBD · NR-PPAR-gamma · SR-ARE · SR-ATAD5 · SR-HSE · SR-MMP · SR-p53

Performance — internal 5-fold cross-validation AUC (Optuna)

Endpoint	CV AUC	Endpoint	CV AUC
NR-AhR	0.910	SR-ARE	0.858
NR-AR	0.828	SR-ATAD5	0.884
NR-AR-LBD	0.878	SR-HSE	0.808
NR-Aromatase	0.862	SR-MMP	0.929
NR-ER	0.747	SR-p53	0.880
NR-ER-LBD	0.854	NR-PPAR-gamma	0.847
		Mean	≈ 0.857

⚠️ Note on metrics (honesty). Full-data / training AUCs (~0.98–1.0) reflect data leakage (the model is fit on all data) and are reference-only — not indicative of generalization. The table above reports proper 5-fold cross-validation AUC. A standardized external (e.g., MoleculeNet held-out) evaluation is pending. For reference, the predecessor v3 (RDKit 217-descriptor RandomForest) measured ~0.779 on an external MoleculeNet split.

Performance — scaffold-split held-out (precise, literature-comparable)

Measured on a Bemis–Murcko scaffold split (80/20, leak-free) using the saved features + tuned XGBoost hyperparameters:

Mean scaffold-holdout AUC = 0.790 (12 tasks)

Endpoint	AUC	Endpoint	AUC
NR-AhR	0.878	SR-ARE	0.782
NR-AR	0.757	SR-ATAD5	0.767
NR-AR-LBD	0.845	SR-HSE	0.753
NR-Aromatase	0.758	SR-MMP	0.872
NR-ER	0.727	SR-p53	0.783
NR-ER-LBD	0.779	NR-PPAR-gamma	0.783

This is the realistic generalization to novel chemical scaffolds — the figure to compare against literature (MoleculeNet GNN SOTA ≈ 0.83–0.85; DeepTox challenge-test 0.846). The random 5-fold CV above (0.857) is optimistic relative to scaffold split. Net: v4 is a strong gradient-boosting baseline (~0.79), on par with / slightly above predecessor v3 (0.779 external), and below current graph-neural-network SOTA. (XGBoost core shown; the full XGB+LGB+DNN ensemble is comparable.)

Companion result — cross-family ensemble (reproducible, beats v4 alone)

Combining v4 with three independently-trained, cross-family models (two SMILES transformers + a graph neural net), evaluated on the same Bemis–Murcko scaffold split (n=1,565 test):

Model (scaffold-test)	Mean AUC
v4 (XGBoost descriptors, this repo)	0.789–0.791
MolFormer-XL fine-tune (3-seed)	0.776
ChemBERTa-77M-MTR fine-tune (3-seed)	0.779
from-scratch GIN graph net (3-seed)	0.765
v4 + MolFormer + ChemBERTa (equal-weight)	0.803
v4 + MolFormer + ChemBERTa + GIN (val-weighted)	0.805

The val-weighted ensemble learns non-negative model weights on a held-out validation set (weights ≈ v4 2.0 : each NN 0.75; no test-set tuning), letting the weaker-but-diverse graph net contribute at low weight. Result: +0.014 over v4 alone, ~0.805, near the realistic scaffold-split ceiling.

Honesty note. Each transformer/GNN alone is below v4 on this scaffold split; the gain comes purely from cross-family error decorrelation (descriptor GBDT + SMILES transformers + graph net). Literature "MolFormer Tox21 ≈ 0.847 / GNN SOTA 0.83–0.85" figures are random-split; on the harder scaffold split (generalization to novel chemotypes) ~0.80–0.81 is near the realistic ceiling. The single models reproduce ~0.85 on a random internal val (matching random-split literature), confirming the split is the cause of the gap.

Reproduction scripts (de novo, leak-free, identical split): tox21_v5_molformer.py, tox21_chemberta.py, tox21_gin.py, stack_ensemble.py.

Intended use

Computational toxicity triage / prioritization for drug and nutraceutical candidates (e.g., flag high-risk endpoints before wet-lab). Not a substitute for experimental toxicity assays.

Limitations

In-silico predictions only; estrogen-receptor endpoints (NR-ER 0.747) are weakest.
Inference requires the same feature pipeline (fingerprints + Mordred + Uni-Mol embeddings).

VIDRAFT · Aether PharmaOS · https://www.vidraft.net

Downloads last month: -; Downloads are not tracked for this model. How to track