Instructions to use deep-plasma-phenotyping/somascan-ad-classification-tabpfn-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- TabPFN
How to use deep-plasma-phenotyping/somascan-ad-classification-tabpfn-v2 with TabPFN:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
SomaScan β Alzheimer's-phenotype models β classification (distilled TabM students, TabPFN v2)
33 lightweight, standalone models that predict discrete Alzheimer's-diseaseβrelated
phenotypes (e.g. medication use, APOE genotype, vascular pathology, sex) from SomaScan
plasma proteomics. The full list is in phenotypes.tsv.
These accompany the manuscript "Fusing depth and breadth: Plasma proteomics scales deep
Alzheimer's phenotyping to global cohorts." Continuous phenotypes (cognition, neuropathology burden,
demographics, longitudinal change) live in the companion repo somascan-ad-regression-tabpfn-v2.
- Task: 33 phenotypes β 26 binary + 7 multiclass (3β6 classes).
- Input: SomaScan plasma proteomics β aptamers named by SeqId (e.g.
seq.10001.7); seeinput_schema.csv(7,287 aptamers). Each model uses its own ~500 selected aptamers. - Artifact: one
meta/<id>.json+ one TabM.pt(~22 MB) per phenotype. - Pipeline:
st-tabpfn-pipeline(not needed at inference).
Quickstart
pip install torch tabm rtdl_num_embeddings numpy pandas
# predicted label for ALL 33 phenotypes (sample_id + one column per phenotype)
python predict.py --input plasma_somascan.tsv --output predictions.csv
# also emit per-class probabilities (columns "<name>::<classlabel>")
python predict.py --input plasma.tsv --output out.csv --proba
# specific phenotypes (name or target_id); list what's available
python predict.py --input plasma.tsv --output out.csv --target dcfdx,apoe_genotype
python predict.py --list
predict.py depends on public packages only. Input is a CSV/TSV with a sample-id column
(sample_id / SAMPLE.ID, optional) plus SomaScan aptamer columns matched by name; column
order is irrelevant and extra columns are ignored. The models were trained on complete data and
ship no imputer β provide complete aptamer values (NaN cells are rejected).
Repo layout
| path | what |
|---|---|
predict.py |
standalone loader / CLI |
phenotypes.tsv |
index: target_id, name, paper category + description, metric, n_classes, CV teacher/student |
meta/<id>.json |
per-model: selected feature names, arch, task, classes, weights path |
models/<id>_student_tabm.pt |
TabM student weights (+ embedding bins) |
input_schema.csv |
the 7,287 SomaScan SeqIds the panel can supply |
parity_table.csv |
per-phenotype teacher-vs-student CV detail |
How the models were built
- Feature pre-filter (capacity reduction, not selection): CatBoost ranks aptamers and the top 500 are passed to the teacher β solely to fit TabPFN's input limit (matches the manuscript:
--method CatBoost,--selected_k 500). - Teacher: TabPFN v2 classifier (
tabpfn-v2-classifier.ckpt), a foundation model. - Distillation β student: a TabM network (parameter-efficient MLP ensemble + piecewise-linear embeddings) is trained to reproduce the teacher's class probabilities on a GMM-augmented transfer set (
gmm:10000). At inference, per-member sigmoid (binary) / softmax (multiclass) are averaged across the TabM ensemble;classesin eachmeta.jsonmaps the output back to the original label. - Deployable model: the teacher is dropped; only the student + metadata ship. No feature scaling.
Performance (held-out cross-validation)
Evaluated on the manuscript's exact per-phenotype, person-grouped 10-fold partition (recovered and re-applied), so these numbers are directly comparable to the paper. The student is distilled separately within each fold.
| task type | metric | n | teacher (mean) | student (mean) | student β₯ teacher |
|---|---|---|---|---|---|
| Binary | ROC-AUC | 26 | 0.658 | 0.662 | 85% |
| Multiclass | macro OVR-AUC | 6 | 0.654 | 0.656 | 67% |
| APOE genotype | accuracy | 1 | 0.981 | 0.982 | β |
Per-phenotype numbers are in parity_table.csv.
Intended use & limitations
- Research use only. Not a diagnostic device; predictions are population-level estimates.
- Platform-specific: trained on SomaScan aptamer measurements; not transferable to other proteomic platforms without recalibration.
License
Student weights are distilled derivatives of a TabPFN v2 teacher. The teacher
(Prior-Labs/TabPFN-v2-clf) is released under the Prior Labs License β Apache 2.0 with an
additional attribution requirement (https://priorlabs.ai/tabpfn-license/). As an
Apache-2.0-based license, these derivative students may be redistributed and used provided the
Prior Labs / TabPFN attribution is preserved.
Citation
If you use these models, cite the manuscript (citation TBD), TabPFN (Hollmann et al., Nature 2025), and TabM (Gorishniy et al., 2024).