You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

sctherapy-artifacts

Trained checkpoints, derived data, and evaluation results for the sctherapy_pytorch project — predicting drug response (% inhibition) from gene expression, benchmarked on a 12-patient AML cohort from the scTherapy paper.

Code: Tino3141/sctherapy_pytorch (FT-Transformer, GRANDE, LightGBM baseline) — the training/eval code, scripts, and configs that produced these artifacts. This repo is self-contained: checkpoints, the full training set, both eval cohorts, and results are all here. (Training data is also mirrored at Tino3141/lincs-pharmaco-training / Tino3141/pharmaco_flat.)

The Python snippets below (from eval.model_registry import ..., from src.model.lgbm import ...) assume you're running inside a clone of the GitHub repo; this repo holds the artifacts, not the code.

Checkpoints

Neural (`checkpoints/neural/`) — the exact checkpoints used in the AML eval

Pulled from Weights & Biases (cpinkl/pmlr-sctherapy). These are the precise epoch checkpoints behind the per-seed predictions in results/aml_evals/ — not the final best-val (epoch-10) snapshots. The eval used an earlier epoch (6–9) per run; the filename encodes the epoch, and each .pt's stored epoch field matches it. Complete {42, 43, 44} seed sweep × 4 variants = 12 files.

File	Arch	Head	Seed	Epoch used	W&B run	val_rmse @ epoch
`ft_transformer_hill__s42__epoch7__run-mwwp1nej.pt`	FT-Transformer	Hill	42	7	mwwp1nej	10.87
`ft_transformer_hill__s43__epoch7__run-k36em957.pt`	FT-Transformer	Hill	43	7	k36em957	10.81
`ft_transformer_hill__s44__epoch7__run-dnyzw2jk.pt`	FT-Transformer	Hill	44	7	dnyzw2jk	10.72
`ft_transformer_scalar__s42__epoch7__run-lv6eowft.pt`	FT-Transformer	Scalar	42	7	lv6eowft	11.01
`ft_transformer_scalar__s43__epoch9__run-1lx7r1vp.pt`	FT-Transformer	Scalar	43	9	1lx7r1vp	10.91
`ft_transformer_scalar__s44__epoch8__run-5mtffjfa.pt`	FT-Transformer	Scalar	44	8	5mtffjfa	10.74
`grande_hill__s42__epoch8__run-fhszu55u.pt`	GRANDE	Hill	42	8	fhszu55u	11.47
`grande_hill__s43__epoch8__run-0vsn6313.pt`	GRANDE	Hill	43	8	0vsn6313	11.56
`grande_hill__s44__epoch6__run-xutm14sz.pt`	GRANDE	Hill	44	6	xutm14sz	11.72
`grande_scalar__s42__epoch7__run-izgn7e8j.pt`	GRANDE	Scalar	42	7	izgn7e8j	30.78
`grande_scalar__s43__epoch6__run-iqngqwll.pt`	GRANDE	Scalar	43	6	iqngqwll	31.00
`grande_scalar__s44__epoch8__run-k2bepvf1.pt`	GRANDE	Scalar	44	8	k2bepvf1	30.45

Each .pt is a dict with model_state_dict, epoch, val_rmse. Load with the project's eval/model_registry.py (state-dict introspection rebuilds the architecture — no separate config needed), or via build_model after reading the shapes.

Why these epochs (not best-val): the evaluated checkpoint was chosen on a combination of the mean and standard deviation of validation performance — favouring a stable epoch over the single lowest-mean snapshot. So the selected epoch (6–9) deliberately trades a marginally lower best-val for lower variance, rather than chasing the minimum mean val_rmse alone.

Recovering other snapshots: the absolute best-val epoch (usually epoch 10) scores marginally lower mean val_rmse (FT-hill ~10.5–10.6, GRANDE-hill ~11.3–11.6). To pull it instead, fetch cpinkl/pmlr-sctherapy/model-<runid>:best (alias best → epoch 10). Any other epoch: model-<runid>:vN where v0=epoch1 … v9=epoch10.

LightGBM (`checkpoints/lgbm/`)

Path	Notes
`cell_seed42_ts0.8_nb1000_es50/model.txt`	Cell-split baseline (held-out cell lines), 1000 rounds, ES@50. Headline LGBM used in the registry. Val RMSE ≈ 6.96, AUC ≈ 0.70 on the 12-patient benchmark.
`lgbm_baseline_model.txt`	Alternative local LGBM.

Usage

The neural .pt files are PyTorch state-dicts ({model_state_dict, epoch, val_rmse}) — weights only, no architecture. The repo's eval/model_registry.py re-infers the architecture from the tensor shapes, so loading needs nothing but the checkpoint and a registry key (ft_scalar, ft_hill, grande_scalar, grande_hill, or lgbm — the filename tells you which).

import torch, numpy as np
from huggingface_hub import hf_hub_download
from eval.model_registry import REGISTRY, load_predictor

# Download one of the archived checkpoints (best FT-hill, seed 43, epoch 7)
ckpt = hf_hub_download(
    "Tino3141/sctherapy-artifacts",
    "checkpoints/neural/ft_transformer_hill__s43__epoch7__run-k36em957.pt",
)

pred = load_predictor(
    REGISTRY["ft_hill"],                 # key must match the checkpoint's arch/head
    device=torch.device("cpu"),
    checkpoint_override=ckpt,
)

# Inputs: gene z-scores (N, 978), ECFP4 bits (N, 1024), dose in µM (N,)
gene  = np.zeros((2, 978), dtype=np.float32)
ecfp4 = np.zeros((2, 1024), dtype=np.float32)
dose  = np.array([1.0, 10.0], dtype=np.float32)

y = pred.predict(gene, ecfp4, dose)      # → predicted % inhibition, shape (N,)
print(y)

LightGBM (model.txt) loads directly:

from src.model.lgbm import LGBMDrugPredictor
lgbm = LGBMDrugPredictor.load("cell_seed42_ts0.8_nb1000_es50/model.txt")
# feature layout: np.hstack([gene(978), ecfp4(1024), dose(1)])

The AML patient inputs are in eval/aml_eval/ (model_inputs, with the ground-truth DSS to score against); the separate multi-cancer held-out set is in eval/zenodo_eval/; derived/gene_names.json gives the gene column order and derived/lincs_gene_stats.npz the (mean, std) used to z-score new expression data. To download everything at once: hf download Tino3141/sctherapy-artifacts --local-dir ./sctherapy-artifacts.

Derived data (`derived/`)

Small generated files needed to reproduce inference (not the raw sources):

File	What
`gene_names.json`	978 LINCS landmark gene symbols, in feature order
`feature_names.json`	Full flat-feature names `[genes
`lincs_gene_stats.npz`	Per-gene LINCS (mean, std) for z-scoring patient inputs
`matched_samples.parquet`	LINCS×PharmacoDB matched `(cell, drug)` rows
`training_data.parquet`	Full training set — 3.24M rows `[sig_id, cell_iname, smiles, dose, inhibition, gene_expression(978), ecfp4(1024)]` (1.5 GB)

The full training parquet is included here so the repo is self-contained; it is also mirrored at Tino3141/lincs-pharmaco-training / Tino3141/pharmaco_flat.

Eval datasets (`eval/`) — two separate cohorts

`eval/aml_eval/` — the 12-patient AML benchmark

The primary patient benchmark (scTherapy paper; consolidated from the public Tino3141/aml12-drug-response-eval dataset). Patients patient1 … patient12.

Path	What
`ground_truth/dss.csv` / `.parquet`	Ground-truth Drug Sensitivity Scores per (patient, drug) — the eval target
`ground_truth/supplementary_scTherapy.xlsx`	scTherapy supplementary Excel the DSS were extracted from
`model_inputs/full_inputs.parquet`	Assembled (patient × drug) model inputs
`model_inputs/patient_deg_vectors.parquet`	978-gene log2FC vector per patient
`model_inputs/drug_fingerprints.parquet`	1024-bit ECFP4 per drug
`model_inputs/example_full_input.parquet`	Small worked example of the input layout
`degs/`	Differential-expression gene sets per patient (raw + filtered + meta)

`eval/zenodo_eval/` — multi-cancer held-out set (NOT AML)

A separate single-cell drug-response set from public GEO/Zenodo studies — HNSCC (SCC47, JHU006, HN120/137), lung (PC9, H1975), CLL, HCC, breast (FCIBC02), etc., across 10 drugs. Each pseudo-bulk carries a binary sensitive/resistant label. Built from the ../tino/*.pt source (named after the author).

Path	What
`expression_zscore.parquet`	Dose-expanded expression, LINCS z-score (built by `scripts/build_zenodo_zscore.py`)
`expression_logfc.parquet`	Dose-expanded expression, LINCS log2FC (built by `scripts/build_zenodo_logfc.py`)

`eval/reference/` — shared lookups

Path	What
`lincs_landmark_genes.csv`	The 978 LINCS landmark genes
`drug_name_to_smiles.json`	Drug-name → canonical SMILES map

Each folder has its own README.md. Raw AML patient .RDS files (~5 GB) are not here — re-downloadable from Zenodo 13340927.

Results (`results/`)

Per-drug / per-patient prediction CSVs, metric tables, and plots from the headline runs (exploratory/superseded variants were pruned):

Dir	Cohort	What
`aml_evals/`	AML	per-arch/head/seed AML predictions — the main neural-vs-LGBM comparison (filenames encode the epoch used)
`aml_lgbm_seed42/`	AML	LGBM baseline AML eval (summary, per-drug, per-patient AUC)
`zenodo_cluster/`	Zenodo	run with all 5 models in one `aggregated.csv` + ROC plots
`zenodo_logfc_all_final_final/`	Zenodo	final full Zenodo-cohort run
`zenodo_logfc_within_sweep/`	Zenodo	within-file hyperparameter sweep grid
`split_comparison/`	—	random vs cell vs drug vs cell_drug split analysis (the data-leakage finding)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

You need to agree to share your contact information to access this model

sctherapy-artifacts

Checkpoints

Neural (checkpoints/neural/) — the exact checkpoints used in the AML eval

LightGBM (checkpoints/lgbm/)

Usage

Derived data (derived/)

Eval datasets (eval/) — two separate cohorts

eval/aml_eval/ — the 12-patient AML benchmark

eval/zenodo_eval/ — multi-cancer held-out set (NOT AML)

eval/reference/ — shared lookups

Results (results/)