Starling oral-bioavailability transfer model

Given two molecules (SMILES + study metadata) and molecule A's measured oral bioavailability, this model predicts whether oral-bioavailability behavior transfers from A to B — i.e. whether the two molecules behave similarly under the given study context. It is self-contained: the frozen encoders are bundled with the trained head, so it runs end-to-end on raw inputs.

Architecture

Per molecule (siamese — the same encoders + projections are applied to A and B; only the head is position-aware):

Molecule encoder — ibm-research/MoLFormer-XL-both-10pct (MolFormer-XL), frozen: SMILES → mean-pooled token embedding → 768-d, then a 2-layer MLP (768→1024→768).
Metadata encoder — sentence-transformers/all-MiniLM-L6-v2 (MiniLM), frozen: each of the 7 metadata fields is embedded separately (mean-pooled, L2-normalized) → 384-d, then a learned per-field projection → 64-d (7×64 = 448-d total). A missing/empty field uses a learned per-field "missing" embedding instead of the text embedding, so absent metadata is handled gracefully and distinctly from any real value.
Per molecule = [mol_mlp (768) | metadata (448)] = 1216-d.

Pair head:

Concatenate [z_A, z_B] (2×1216) + molecule A's bioavailability scalar (value_A / 100) → 2433-d input.
A pre-norm residual SwiGLU MLP (32 blocks, width 1024, FFN 4096) → one logit.
sigmoid(logit) = P(transfer). ~407M trainable params; encoders frozen.

Metadata fields (order matters)

molecule_name, species_or_population, dose, oral_exposure_mode, qualifying_conditions, comparator, extra_details

Pass a dict per molecule keyed by these names. Omit a key, or pass None/"", for a missing field — the model then uses its learned per-field "missing" embedding.

Usage

from transformers import AutoModel
m = AutoModel.from_pretrained("jiosephlee/starling-transfer-ssv2-srcval", trust_remote_code=True).eval()

out = m(
    smiles_a=["CC(=O)Oc1ccccc1C(=O)O"],          # molecule A (bioavailability known)
    smiles_b=["CCO"],                            # molecule B (candidate)
    metadata_a=[{"species_or_population": "human", "dose": "325 mg", "oral_exposure_mode": "tablet"}],
    metadata_b=[{"species_or_population": "human"}],   # missing fields are fine
    source_value=[68.0],                         # molecule A's RAW oral_bioavailability_value (e.g. percent)
)
p_transfer = out.logits.sigmoid()                # batched: pass parallel lists for many pairs

source_value is molecule A's raw oral_bioavailability_value; the model scales it internally by 100. Inputs are batched lists of equal length.

Training & performance

Trained on the same_species_v2 oral-bioavailability transfer split (~338M molecule pairs; the frozen embeddings are precomputed once and the head is trained on top). The label is |value_A - value_B| thresholded, so the model uses A's known value as an anchor and learns to estimate B's bioavailability from its structure + metadata.

same_species_v2 validation: AUROC ~0.87, accuracy ~0.83, macro-F1 ~0.79
tianang (cross-dataset) validation: AUROC ~0.95, accuracy ~0.91, macro-F1 ~0.89 (test: AUROC ~0.95)

Downloads last month: 281

Safetensors

Model size

0.5B params

Tensor type

F32