SomaScan β†’ Alzheimer's-phenotype models β€” classification (distilled TabM students, TabPFN v2)

33 lightweight, standalone models that predict discrete Alzheimer's-disease–related phenotypes (e.g. medication use, APOE genotype, vascular pathology, sex) from SomaScan plasma proteomics. The full list is in phenotypes.tsv.

These accompany the manuscript "Fusing depth and breadth: Plasma proteomics scales deep Alzheimer's phenotyping to global cohorts." Continuous phenotypes (cognition, neuropathology burden, demographics, longitudinal change) live in the companion repo somascan-ad-regression-tabpfn-v2.

  • Task: 33 phenotypes β€” 26 binary + 7 multiclass (3–6 classes).
  • Input: SomaScan plasma proteomics β€” aptamers named by SeqId (e.g. seq.10001.7); see input_schema.csv (7,287 aptamers). Each model uses its own ~500 selected aptamers.
  • Artifact: one meta/<id>.json + one TabM .pt (~22 MB) per phenotype.
  • Pipeline: st-tabpfn-pipeline (not needed at inference).

Quickstart

pip install torch tabm rtdl_num_embeddings numpy pandas

# predicted label for ALL 33 phenotypes (sample_id + one column per phenotype)
python predict.py --input plasma_somascan.tsv --output predictions.csv

# also emit per-class probabilities  (columns "<name>::<classlabel>")
python predict.py --input plasma.tsv --output out.csv --proba

# specific phenotypes (name or target_id); list what's available
python predict.py --input plasma.tsv --output out.csv --target dcfdx,apoe_genotype
python predict.py --list

predict.py depends on public packages only. Input is a CSV/TSV with a sample-id column (sample_id / SAMPLE.ID, optional) plus SomaScan aptamer columns matched by name; column order is irrelevant and extra columns are ignored. The models were trained on complete data and ship no imputer β€” provide complete aptamer values (NaN cells are rejected).

Repo layout

path what
predict.py standalone loader / CLI
phenotypes.tsv index: target_id, name, paper category + description, metric, n_classes, CV teacher/student
meta/<id>.json per-model: selected feature names, arch, task, classes, weights path
models/<id>_student_tabm.pt TabM student weights (+ embedding bins)
input_schema.csv the 7,287 SomaScan SeqIds the panel can supply
parity_table.csv per-phenotype teacher-vs-student CV detail

How the models were built

  1. Feature pre-filter (capacity reduction, not selection): CatBoost ranks aptamers and the top 500 are passed to the teacher β€” solely to fit TabPFN's input limit (matches the manuscript: --method CatBoost, --selected_k 500).
  2. Teacher: TabPFN v2 classifier (tabpfn-v2-classifier.ckpt), a foundation model.
  3. Distillation β†’ student: a TabM network (parameter-efficient MLP ensemble + piecewise-linear embeddings) is trained to reproduce the teacher's class probabilities on a GMM-augmented transfer set (gmm:10000). At inference, per-member sigmoid (binary) / softmax (multiclass) are averaged across the TabM ensemble; classes in each meta.json maps the output back to the original label.
  4. Deployable model: the teacher is dropped; only the student + metadata ship. No feature scaling.

Performance (held-out cross-validation)

Evaluated on the manuscript's exact per-phenotype, person-grouped 10-fold partition (recovered and re-applied), so these numbers are directly comparable to the paper. The student is distilled separately within each fold.

task type metric n teacher (mean) student (mean) student β‰₯ teacher
Binary ROC-AUC 26 0.658 0.662 85%
Multiclass macro OVR-AUC 6 0.654 0.656 67%
APOE genotype accuracy 1 0.981 0.982 β€”

Per-phenotype numbers are in parity_table.csv.

Intended use & limitations

  • Research use only. Not a diagnostic device; predictions are population-level estimates.
  • Platform-specific: trained on SomaScan aptamer measurements; not transferable to other proteomic platforms without recalibration.

License

Student weights are distilled derivatives of a TabPFN v2 teacher. The teacher (Prior-Labs/TabPFN-v2-clf) is released under the Prior Labs License β€” Apache 2.0 with an additional attribution requirement (https://priorlabs.ai/tabpfn-license/). As an Apache-2.0-based license, these derivative students may be redistributed and used provided the Prior Labs / TabPFN attribution is preserved.

Citation

If you use these models, cite the manuscript (citation TBD), TabPFN (Hollmann et al., Nature 2025), and TabM (Gorishniy et al., 2024).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support