Tox21-v4 — VIDRAFT Aether PharmaOS Toxicity Models
12-endpoint Tox21 toxicity prediction models, de novo trained by VIDRAFT Aether PharmaOS.
Successor to VIDraft/tox21-v3-models, adding Mordred + Uni-Mol features and a multitask DNN stacker.
Architecture
- Per-task ensemble: XGBoost ×30 seeds + LightGBM ×20 seeds, with Optuna 100-trial hyperparameter search (5-fold CV per task) → 12 task
.pklfiles. - Multitask DNN stacker (PyTorch, 120 epochs) →
dnn_final.pt/dnn_best.pt. - Features (10,469-dim): Morgan/FCFP fingerprints (8,390) + Mordred descriptors (1,567) + Uni-Mol embeddings (512).
- Preprocessing:
scaler.pkl,mordred_imputer.pkl. Config:v4_info.json. - Training set: 7,823 molecules (Tox21).
Endpoints (12)
NR-AhR · NR-AR · NR-AR-LBD · NR-Aromatase · NR-ER · NR-ER-LBD · NR-PPAR-gamma · SR-ARE · SR-ATAD5 · SR-HSE · SR-MMP · SR-p53
Performance — internal 5-fold cross-validation AUC (Optuna)
| Endpoint | CV AUC | Endpoint | CV AUC |
|---|---|---|---|
| NR-AhR | 0.910 | SR-ARE | 0.858 |
| NR-AR | 0.828 | SR-ATAD5 | 0.884 |
| NR-AR-LBD | 0.878 | SR-HSE | 0.808 |
| NR-Aromatase | 0.862 | SR-MMP | 0.929 |
| NR-ER | 0.747 | SR-p53 | 0.880 |
| NR-ER-LBD | 0.854 | NR-PPAR-gamma | 0.847 |
| Mean | ≈ 0.857 |
⚠️ Note on metrics (honesty). Full-data / training AUCs (~0.98–1.0) reflect data leakage (the model is fit on all data) and are reference-only — not indicative of generalization. The table above reports proper 5-fold cross-validation AUC. A standardized external (e.g., MoleculeNet held-out) evaluation is pending. For reference, the predecessor v3 (RDKit 217-descriptor RandomForest) measured ~0.779 on an external MoleculeNet split.
Performance — scaffold-split held-out (precise, literature-comparable)
Measured on a Bemis–Murcko scaffold split (80/20, leak-free) using the saved features + tuned XGBoost hyperparameters:
Mean scaffold-holdout AUC = 0.790 (12 tasks)
| Endpoint | AUC | Endpoint | AUC |
|---|---|---|---|
| NR-AhR | 0.878 | SR-ARE | 0.782 |
| NR-AR | 0.757 | SR-ATAD5 | 0.767 |
| NR-AR-LBD | 0.845 | SR-HSE | 0.753 |
| NR-Aromatase | 0.758 | SR-MMP | 0.872 |
| NR-ER | 0.727 | SR-p53 | 0.783 |
| NR-ER-LBD | 0.779 | NR-PPAR-gamma | 0.783 |
This is the realistic generalization to novel chemical scaffolds — the figure to compare against literature (MoleculeNet GNN SOTA ≈ 0.83–0.85; DeepTox challenge-test 0.846). The random 5-fold CV above (0.857) is optimistic relative to scaffold split. Net: v4 is a strong gradient-boosting baseline (~0.79), on par with / slightly above predecessor v3 (0.779 external), and below current graph-neural-network SOTA. (XGBoost core shown; the full XGB+LGB+DNN ensemble is comparable.)
Companion result — cross-family ensemble (reproducible, beats v4 alone)
Combining v4 with three independently-trained, cross-family models (two SMILES transformers + a graph neural net), evaluated on the same Bemis–Murcko scaffold split (n=1,565 test):
| Model (scaffold-test) | Mean AUC |
|---|---|
| v4 (XGBoost descriptors, this repo) | 0.789–0.791 |
| MolFormer-XL fine-tune (3-seed) | 0.776 |
| ChemBERTa-77M-MTR fine-tune (3-seed) | 0.779 |
| from-scratch GIN graph net (3-seed) | 0.765 |
| v4 + MolFormer + ChemBERTa (equal-weight) | 0.803 |
| v4 + MolFormer + ChemBERTa + GIN (val-weighted) | 0.805 |
The val-weighted ensemble learns non-negative model weights on a held-out validation set (weights ≈ v4 2.0 : each NN 0.75; no test-set tuning), letting the weaker-but-diverse graph net contribute at low weight. Result: +0.014 over v4 alone, ~0.805, near the realistic scaffold-split ceiling.
Honesty note. Each transformer/GNN alone is below v4 on this scaffold split; the gain comes purely from cross-family error decorrelation (descriptor GBDT + SMILES transformers + graph net). Literature "MolFormer Tox21 ≈ 0.847 / GNN SOTA 0.83–0.85" figures are random-split; on the harder scaffold split (generalization to novel chemotypes) ~0.80–0.81 is near the realistic ceiling. The single models reproduce ~0.85 on a random internal val (matching random-split literature), confirming the split is the cause of the gap.
Reproduction scripts (de novo, leak-free, identical split): tox21_v5_molformer.py, tox21_chemberta.py, tox21_gin.py, stack_ensemble.py.
Intended use
Computational toxicity triage / prioritization for drug and nutraceutical candidates (e.g., flag high-risk endpoints before wet-lab). Not a substitute for experimental toxicity assays.
Limitations
- In-silico predictions only; estrogen-receptor endpoints (NR-ER 0.747) are weakest.
- Inference requires the same feature pipeline (fingerprints + Mordred + Uni-Mol embeddings).
VIDRAFT · Aether PharmaOS · https://www.vidraft.net