Starling oral-bioavailability transfer model

Given two molecules (SMILES + study metadata) and molecule A's measured oral bioavailability, this model predicts whether oral-bioavailability behavior transfers from A to B β€” i.e. whether the two molecules behave similarly under the given study context. It is self-contained: the frozen encoders are bundled with the trained head, so it runs end-to-end on raw inputs.

Architecture

Per molecule (siamese β€” the same encoders + projections are applied to A and B; only the head is position-aware):

  • Molecule encoder β€” ibm-research/MoLFormer-XL-both-10pct (MolFormer-XL), frozen: SMILES β†’ mean-pooled token embedding β†’ 768-d, then a 2-layer MLP (768β†’1024β†’768).
  • Metadata encoder β€” sentence-transformers/all-MiniLM-L6-v2 (MiniLM), frozen: each of the 7 metadata fields is embedded separately (mean-pooled, L2-normalized) β†’ 384-d, then a learned per-field projection β†’ 64-d (7Γ—64 = 448-d total). A missing/empty field uses a learned per-field "missing" embedding instead of the text embedding, so absent metadata is handled gracefully and distinctly from any real value.
  • Per molecule = [mol_mlp (768) | metadata (448)] = 1216-d.

Pair head:

  • Concatenate [z_A, z_B] (2Γ—1216) + molecule A's bioavailability scalar (value_A / 100) β†’ 2433-d input.
  • A pre-norm residual SwiGLU MLP (32 blocks, width 1024, FFN 4096) β†’ one logit.
  • sigmoid(logit) = P(transfer). ~407M trainable params; encoders frozen.

Metadata fields (order matters)

molecule_name, species_or_population, dose, oral_exposure_mode, qualifying_conditions, comparator, extra_details

Pass a dict per molecule keyed by these names. Omit a key, or pass None/"", for a missing field β€” the model then uses its learned per-field "missing" embedding.

Usage

from transformers import AutoModel
m = AutoModel.from_pretrained("jiosephlee/starling-transfer-ssv2-srcval", trust_remote_code=True).eval()

out = m(
    smiles_a=["CC(=O)Oc1ccccc1C(=O)O"],          # molecule A (bioavailability known)
    smiles_b=["CCO"],                            # molecule B (candidate)
    metadata_a=[{"species_or_population": "human", "dose": "325 mg", "oral_exposure_mode": "tablet"}],
    metadata_b=[{"species_or_population": "human"}],   # missing fields are fine
    source_value=[68.0],                         # molecule A's RAW oral_bioavailability_value (e.g. percent)
)
p_transfer = out.logits.sigmoid()                # batched: pass parallel lists for many pairs

source_value is molecule A's raw oral_bioavailability_value; the model scales it internally by 100. Inputs are batched lists of equal length.

Training & performance

Trained on the same_species_v2 oral-bioavailability transfer split (~338M molecule pairs; the frozen embeddings are precomputed once and the head is trained on top). The label is |value_A - value_B| thresholded, so the model uses A's known value as an anchor and learns to estimate B's bioavailability from its structure + metadata.

  • same_species_v2 validation: AUROC ~0.87, accuracy ~0.83, macro-F1 ~0.79
  • tianang (cross-dataset) validation: AUROC ~0.95, accuracy ~0.91, macro-F1 ~0.89 (test: AUROC ~0.95)
Downloads last month
281
Safetensors
Model size
0.5B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support