Tox21-v4 — VIDRAFT Aether PharmaOS Toxicity Models

12-endpoint Tox21 toxicity prediction models, de novo trained by VIDRAFT Aether PharmaOS. Successor to VIDraft/tox21-v3-models, adding Mordred + Uni-Mol features and a multitask DNN stacker.

Architecture

  • Per-task ensemble: XGBoost ×30 seeds + LightGBM ×20 seeds, with Optuna 100-trial hyperparameter search (5-fold CV per task) → 12 task .pkl files.
  • Multitask DNN stacker (PyTorch, 120 epochs) → dnn_final.pt / dnn_best.pt.
  • Features (10,469-dim): Morgan/FCFP fingerprints (8,390) + Mordred descriptors (1,567) + Uni-Mol embeddings (512).
  • Preprocessing: scaler.pkl, mordred_imputer.pkl. Config: v4_info.json.
  • Training set: 7,823 molecules (Tox21).

Endpoints (12)

NR-AhR · NR-AR · NR-AR-LBD · NR-Aromatase · NR-ER · NR-ER-LBD · NR-PPAR-gamma · SR-ARE · SR-ATAD5 · SR-HSE · SR-MMP · SR-p53

Performance — internal 5-fold cross-validation AUC (Optuna)

Endpoint CV AUC Endpoint CV AUC
NR-AhR 0.910 SR-ARE 0.858
NR-AR 0.828 SR-ATAD5 0.884
NR-AR-LBD 0.878 SR-HSE 0.808
NR-Aromatase 0.862 SR-MMP 0.929
NR-ER 0.747 SR-p53 0.880
NR-ER-LBD 0.854 NR-PPAR-gamma 0.847
Mean ≈ 0.857

⚠️ Note on metrics (honesty). Full-data / training AUCs (~0.98–1.0) reflect data leakage (the model is fit on all data) and are reference-only — not indicative of generalization. The table above reports proper 5-fold cross-validation AUC. A standardized external (e.g., MoleculeNet held-out) evaluation is pending. For reference, the predecessor v3 (RDKit 217-descriptor RandomForest) measured ~0.779 on an external MoleculeNet split.

Performance — scaffold-split held-out (precise, literature-comparable)

Measured on a Bemis–Murcko scaffold split (80/20, leak-free) using the saved features + tuned XGBoost hyperparameters:

Mean scaffold-holdout AUC = 0.790 (12 tasks)

Endpoint AUC Endpoint AUC
NR-AhR 0.878 SR-ARE 0.782
NR-AR 0.757 SR-ATAD5 0.767
NR-AR-LBD 0.845 SR-HSE 0.753
NR-Aromatase 0.758 SR-MMP 0.872
NR-ER 0.727 SR-p53 0.783
NR-ER-LBD 0.779 NR-PPAR-gamma 0.783

This is the realistic generalization to novel chemical scaffolds — the figure to compare against literature (MoleculeNet GNN SOTA ≈ 0.83–0.85; DeepTox challenge-test 0.846). The random 5-fold CV above (0.857) is optimistic relative to scaffold split. Net: v4 is a strong gradient-boosting baseline (~0.79), on par with / slightly above predecessor v3 (0.779 external), and below current graph-neural-network SOTA. (XGBoost core shown; the full XGB+LGB+DNN ensemble is comparable.)

Companion result — cross-family ensemble (reproducible, beats v4 alone)

Combining v4 with three independently-trained, cross-family models (two SMILES transformers + a graph neural net), evaluated on the same Bemis–Murcko scaffold split (n=1,565 test):

Model (scaffold-test) Mean AUC
v4 (XGBoost descriptors, this repo) 0.789–0.791
MolFormer-XL fine-tune (3-seed) 0.776
ChemBERTa-77M-MTR fine-tune (3-seed) 0.779
from-scratch GIN graph net (3-seed) 0.765
v4 + MolFormer + ChemBERTa (equal-weight) 0.803
v4 + MolFormer + ChemBERTa + GIN (val-weighted) 0.805

The val-weighted ensemble learns non-negative model weights on a held-out validation set (weights ≈ v4 2.0 : each NN 0.75; no test-set tuning), letting the weaker-but-diverse graph net contribute at low weight. Result: +0.014 over v4 alone, ~0.805, near the realistic scaffold-split ceiling.

Honesty note. Each transformer/GNN alone is below v4 on this scaffold split; the gain comes purely from cross-family error decorrelation (descriptor GBDT + SMILES transformers + graph net). Literature "MolFormer Tox21 ≈ 0.847 / GNN SOTA 0.83–0.85" figures are random-split; on the harder scaffold split (generalization to novel chemotypes) ~0.80–0.81 is near the realistic ceiling. The single models reproduce ~0.85 on a random internal val (matching random-split literature), confirming the split is the cause of the gap.

Reproduction scripts (de novo, leak-free, identical split): tox21_v5_molformer.py, tox21_chemberta.py, tox21_gin.py, stack_ensemble.py.

Intended use

Computational toxicity triage / prioritization for drug and nutraceutical candidates (e.g., flag high-risk endpoints before wet-lab). Not a substitute for experimental toxicity assays.

Limitations

  • In-silico predictions only; estrogen-receptor endpoints (NR-ER 0.747) are weakest.
  • Inference requires the same feature pipeline (fingerprints + Mordred + Uni-Mol embeddings).

VIDRAFT · Aether PharmaOS · https://www.vidraft.net

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support