Built for drug discovery, but I think the architecture can cut LLM hallucination — feedback wanted from AI/agent builders

by williamTLmiller - opened 4 days ago

TL;DR: this repo is a small, honest, fully-reproducible drug-discovery model — but the architecture behind it is a general routing primitive, and I think the same idea has a real shot at reducing LLM/agent hallucination. Looking for pushback from people building agents/routers.

What's actually in this repo

MillerBind-Open v1 predicts protein-ligand binding affinity. It folds every atom into one of 12 "harmonic classes" by atomic number (HIN(Z) = 1 + ((Z-1) mod 12)), builds a raw contact-histogram between protein and ligand atoms, and lets a small tree ensemble learn the interaction patterns end-to-end — no calibrated chemistry knowledge baked in. Trained from scratch on 621 complexes pulled live from RCSB's own public BindingDB data. Full data + training pipeline included, so you can audit or retrain it yourself.

It's intentionally modest (R≈0.62 held-out) because it's the open reference implementation, not the production model.

Why I think this matters beyond chemistry

The fold-map + compatibility-scoring + phase-coherence approach isn't chemistry-specific. The same construction (patent-pending, US provisional 64/102,152) generalizes to:

nuclear binding-energy prediction (competitive with established physics models)
battery state-of-health prediction (#1 on a public benchmark)
legal-outcome and math-domain classification

And the part I think AI/agent builders will find interesting: the same structure recurses into a matrix-of-matrices router — every "cell" is both a node and a doorway into a child matrix with its own rules. A query gets mapped into a harmonic class, routed through progressively narrower sub-matrices (task type → subfield → evidence/tools → allowed answer form → verify/abstain), and the model is structurally restricted to its active sub-world unless confidence/evidence gates authorize a transition.

My hypothesis: that gating is a structural mechanism for reducing topic drift and hallucination in LLM/agent pipelines — not a benchmarked claim yet, but a concrete, implementable one. A flat router can jump from "battery chemistry" to "medical advice" in one bad step; a gated, nested router can't, unless it earns the transition.

What I'm asking

This specific repo is the biology reference model, not the AI-routing implementation. Before I build and release a minimal AI-routing reference implementation of the same idea:

Does this resonate with anyone building agent routers / RAG pipelines / multi-model orchestration today?
Where do you think this breaks? (Latency from nested routing? Cold-start cost of building the gates? Something else?)
Would a small open reference implementation (router only, no domain-specific weights) be useful to you?

Happy to go deep on the math or the failure modes. Genuinely want the pushback before I build the next piece.

— William T. L. Miller

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment