MillerBind-Open v1

License: CC BY-NC 4.0 Python Tests

An open-weight, fully reproducible reference model for protein-ligand binding-affinity prediction.

Trained from scratch on public RCSB PDB / BindingDB data, using only publicly-disclosed mathematics (see 01_build_dataset.py and 02_featurize_and_train.py in this repository for the complete, runnable training recipe β€” data collection through final weights).

Author / inventor: William T. L. Miller License: CC BY-NC 4.0 (free for research and non-commercial use) Paper: PAPER.md / PAPER.pdf β€” full technical report (methods, data pipeline, honest evaluation, limitations)


What this is

MillerBind-Open is the open-weight member of the MillerBind family of structure-based binding-affinity models. It implements the publicly-disclosed core of the MillerBind method (patent pending, US provisional application no. 64/102,152):

  1. HIN-12 atom classification β€” every atom is folded into one of 12 harmonic classes by its atomic number: HIN(Z) = 1 + ((Z - 1) mod 12).
  2. Contact histogram features β€” for every protein-ligand atom pair within 8 Γ…, the model accumulates a 12Γ—12 raw contact-count histogram and a 12Γ—12 distance-weighted contact histogram (297 features total, including geometry and phase-coherence summary statistics).
  3. Phase-coherence ratio β€” the publicly-disclosed modular rule (|HIN_p βˆ’ HIN_l| mod 3 == 0) is used to compute a constructive/destructive contact ratio.
  4. An ExtraTrees regressor, trained end-to-end on these raw features to predict a pKd-equivalent binding affinity.

This is a real, working, independently-trained model β€” not a stub, demo, or placeholder. Run it on any protein-ligand complex and get a genuine prediction (see Quickstart below).

What this is NOT

This is not MillerBind's private production model (v9 / v12), which powers the hosted platform and is validated at state-of-the-art accuracy on CASF-2016 and the TDC BindingDB_Kd leaderboard. The production models use a calibrated 12Γ—12 atom-pair compatibility matrix, a calibrated 9Γ—9 residue-pair matrix, calibrated HTOE energy-boost constants, and a calibrated XGBoost+ExtraTrees ensemble blend β€” all of which are trade secrets and are not included here, not derived here, and were not used to build this repository in any way.

MillerBind-Open instead lets the regressor learn pairwise interaction importance directly from a small public training set, with no precomputed compatibility matrix at all. It is intentionally simpler, and its accuracy (below) is intentionally and honestly weaker than the production system.

How to Use

Installation

git clone https://huggingface.co/williamTLmiller/millerbind-open-v1
cd millerbind-open-v1
pip install -r requirements.txt

Command line

python predict.py --complex your_complex.pdb --ligand-resname LIG

Or with separate protein/ligand files:

python predict.py --protein protein.pdb --ligand ligand.pdb --ligand-resname LIG

Example (using the bundled example_4dkl.pdb β€” PDB 4DKL, the ΞΌ-opioid receptor bound to the antagonist BF0 β€” a structure with no public binding-affinity annotation, so it is genuinely held out of this model's training data):

python predict.py --complex example_4dkl.pdb --ligand-resname BF0
Protein atoms: 3476
Ligand atoms:  33

==========================================
MillerBind-Open v1 Prediction
==========================================
Predicted pAffinity (pKd-equivalent): 6.72
Predicted Kd-equivalent:              189.95 nM
Affinity:                             Moderate (uM)
==========================================

Python API

from predict import parse_protein, parse_ligand, predict

p_atoms, p_coords = parse_protein("protein.pdb")
l_atoms, l_coords = parse_ligand("ligand.pdb", resname="LIG")
p_affinity, kd_nm = predict(p_atoms, p_coords, l_atoms, l_coords)
print(f"pAffinity={p_affinity:.2f}  Kd={kd_nm:.1f} nM")

Training data

621 protein-ligand complexes were assembled live from RCSB PDB's own public rcsb_binding_affinity annotations (sourced from BindingDB, queried via RCSB's public GraphQL API β€” see 01_build_dataset.py). For each entry:

  • A primary binding measurement was selected (preferring Kd, then Ki, then IC50/EC50; median taken when multiple measurements of the same type exist) and converted to a pKd-equivalent: pAffinity = -log10(value_in_M).
  • The actual 3D structure was downloaded directly from RCSB PDB (public domain).
  • Protein atoms (ATOM records) and the specific annotated ligand (HETATM records matching the annotation's component ID) were parsed.

This is not a redistribution of any third-party curated dataset (e.g. PDBbind) β€” every data point is fetched live from RCSB's own public APIs by the included scripts, so the entire pipeline is independently reproducible by anyone with no special access.

Final affinity range: pAffinity 2.19–11.15 (a ~9-log-unit span β€” i.e. a genuinely diverse set of strong, moderate, and weak binders), mean 6.82, std 1.58.

Training procedure

See 01_build_dataset.py (data collection) and 02_featurize_and_train.py (featurization + training) for the complete, runnable recipe. Summary:

  • 80/20 random train/test split (random_state=42), 493 train / 124 test.
  • sklearn.ensemble.ExtraTreesRegressor: 400 trees, max_depth=12, min_samples_leaf=3, max_features="sqrt".
  • The released model.joblib is refit on the full 617-complex dataset (after dropping a handful of complexes with fewer than 3 contacts); metrics below are from the held-out split before that final refit.

Evaluation (held-out test set, n=124)

Metric Value
Pearson R 0.623
Spearman ρ 0.593
MAE (pAffinity) 0.999
RMSE (pAffinity) 1.238

For context: AutoDock Vina scores Rβ‰ˆ0.60 on the CASF-2016 benchmark; this open model's Rβ‰ˆ0.62 on its own (different, smaller, more diverse) held-out set is a believable result for a from-scratch baseline with ~500 training examples and no calibrated chemistry knowledge β€” consistent with, and nowhere near, the private production model's published CASF-2016 results (R=0.890 for v9, R=0.938 for v12). See metrics.json for the full held-out PDB ID list used in this evaluation.

This is not a claim of state-of-the-art performance. It is an honest, reproducible, small-data reference baseline.

Intended use

  • Research, education, and benchmarking of structure-based binding-affinity methods.
  • A transparent, fully-reproducible reference implementation of the publicly-disclosed MillerBind feature-engineering approach (HIN-12 folding, contact histograms, phase-coherence).
  • A starting point for further open research β€” the training scripts are included specifically so the community can extend, retrain on a larger public corpus, or build on this baseline.

Not intended for: clinical use, FDA submissions, or any decision with real-world health or safety consequences. Predictions are not validated for production drug-discovery decisions β€” for that, see the commercially licensed production models.

Limitations

  • Trained on only 617 complexes β€” small by modern ML standards. Expect higher variance and lower accuracy than models trained on the full PDBbind corpus (18,000+ complexes).
  • No explicit water modeling, no pose generation/docking, no learned compatibility matrix β€” the model only sees raw geometric contact histograms.
  • Ligand selection in --complex mode without --ligand-resname uses a simple heuristic (first non-solvent/non-ion HETATM group) and may pick the wrong group in complexes with multiple bound ligands; specify --ligand-resname explicitly for reliable results.
  • No uncertainty quantification beyond the held-out metrics above.

Relationship to the patent

This implementation is consistent with and disclosed by US provisional patent application no. 64/102,152 ("System and method for predictive modeling of structured systems using a modular harmonic fold map, compatibility-matrix interaction scoring, phase-coherence noise filtering, and multi-scale modular-shell feature decomposition"), filed by William T. L. Miller. The patent's "learned end-to-end" embodiment (as opposed to the calibrated-matrix embodiment used by the private production models) is what this repository implements and releases as open weights.

Commercial use & the production models

For commercial use, higher accuracy, or access to the state-of-the-art production models (CASF-2016 R=0.938, TDC BindingDB_Kd leaderboard results), contact the inventor, William T. L. Miller, directly.

Citation

@software{miller2026millerbindopen,
  author    = {Miller, William T. L.},
  title     = {{MillerBind-Open}: an open-weight reference model for
               protein-ligand binding-affinity prediction},
  year      = {2026},
  license   = {CC-BY-NC-4.0}
}

Running the tests

pip install -r requirements.txt pytest
pytest tests/ -v

9 tests cover the HIN-12 fold map, feature computation, end-to-end prediction, numerical stability, error handling, and a guard test that fails the build if any private module is ever accidentally imported.

Repository layout

millerbind-open-v1/
β”œβ”€β”€ README.md                    # this file (model card)
β”œβ”€β”€ LICENSE.md                   # CC BY-NC 4.0
β”œβ”€β”€ CITATION.cff                 # machine-readable citation
β”œβ”€β”€ CHANGELOG.md
β”œβ”€β”€ CONTRIBUTING.md
β”œβ”€β”€ CODE_OF_CONDUCT.md
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ model.joblib                 # trained ExtraTreesRegressor (open weights)
β”œβ”€β”€ feature_names.json           # ordered list of the 297 input features
β”œβ”€β”€ metrics.json                 # held-out metrics + train/test PDB ID split
β”œβ”€β”€ predict.py                   # standalone CLI + Python API
β”œβ”€β”€ example_4dkl.pdb             # sample complex used in the Quickstart
β”œβ”€β”€ 01_build_dataset.py          # reproducible data collection (RCSB/BindingDB)
β”œβ”€β”€ 02_featurize_and_train.py    # reproducible featurization + training
└── tests/
    └── test_predict.py

Files in this repository

File Description
model.joblib Trained ExtraTreesRegressor (open weights)
feature_names.json Ordered list of the 297 input feature names
metrics.json Held-out evaluation metrics + train/test PDB ID split
predict.py Standalone predictor β€” no dependency on any private code
01_build_dataset.py Reproducible data-collection script (RCSB/BindingDB)
02_featurize_and_train.py Reproducible feature engineering + training script
tests/test_predict.py Test suite (run with pytest tests/ -v)
requirements.txt Python dependencies
LICENSE.md CC BY-NC 4.0 license text
CITATION.cff Machine-readable citation metadata
CHANGELOG.md Release history
CONTRIBUTING.md How to contribute
CODE_OF_CONDUCT.md Community guidelines

The architecture is public so anyone can read it, audit it, retrain it, or build on it. The production calibration and trained weights remain private under separate commercial license.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Evaluation results

  • Pearson R (held-out, n=124) on RCSB PDB / BindingDB public affinity annotations (617 complexes, custom-built)
    self-reported
    0.623
  • Spearman rho (held-out, n=124) on RCSB PDB / BindingDB public affinity annotations (617 complexes, custom-built)
    self-reported
    0.593
  • MAE (pAffinity units, held-out, n=124) on RCSB PDB / BindingDB public affinity annotations (617 complexes, custom-built)
    self-reported
    0.999
  • RMSE (pAffinity units, held-out, n=124) on RCSB PDB / BindingDB public affinity annotations (617 complexes, custom-built)
    self-reported
    1.238