SomaScan Inflammation Aging Clock (85 proteins)

A lightweight plasma-protein aging clock that predicts chronological age from 85 unique inflammatory proteins measured by SomaScan (125 aptamers / SomaScan features). The model is a TabM† student distilled from a TabPFN v2 teacher, so it runs at inference without any TabPFN dependency (small, DUA-friendly artifacts).

Trained on the ROSMAP cohort, panel-matched to the Olink Inflammation companion model.

Cohort Platform Aptamers (features) Unique proteins N (persons) Teacher R² Student R² Gap (mean ± sd)
ROSMAP SomaScan 125 85 1,611 (1,313) 0.304 0.302 0.002 ± 0.006

(10-fold person-grouped CV, R². Student fidelity-RMSE ≈ 0.070 in scaled-y.) The TabM† student reproduces its TabPFN v2 teacher to within ~0.002 R² — inside fold-to-fold noise.

Note on counts: SomaScan measures some proteins with more than one aptamer. The model consumes 125 aptamer-level features (the meta.jsonfeature_name list), which map to 85 unique proteins.

Files

predict.py                       standalone inference script
meta.json                        feature order, y-scaler, arch config
models/
  T0001_model.pkl                slim model (~4 KB)
  T0001_student_tabm.pt          TabM† weights (~8 MB) + quantile bins
results/, *_results.csv          aggregate CV metrics summaries

Target T0001 = age_at_visit (chronological age at the visit).

Usage

predict.py runs the model with only public dependencies:

pip install torch tabm rtdl_num_embeddings numpy pandas

python predict.py --input proteins.csv --output ages.csv

--input is a CSV/TSV with one row per sample and one column per aptamer feature, named exactly as in meta.jsonfeature_name (125 SomaScan features). Column order does not matter; an optional sample_id column is carried through. Output is sample_id, predicted_age. This model expects complete values (no imputer). The script errors if any required feature is missing and warns about unused input columns.

Method

  • Teacher: TabPFN v2.
  • Student: TabM† distilled on a GMM-augmented transfer set (target 10,000 rows) with a 5-quantile pinball regression loss.
  • CV: 10-fold person-grouped (verified no person spans >1 fold).

Inference contract

Order the features by meta.json → feature_name; rebuild the TabM model with piecewise-linear numeric embeddings and load the weights; average the 5-quantile (trapezoidal-mean) point estimate over the 32 ensemble members; inverse the y-scaler to recover age.

Citation

Inflammatory aging clock for plasma inflammatory proteins (ROSMAP). Manuscript in preparation. See also the Olink companion model: inflammatory-aging-clock/olink-inflammation-92.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support