SomaScan Inflammation Aging Clock (85 proteins)
A lightweight plasma-protein aging clock that predicts chronological age from 85 unique inflammatory proteins measured by SomaScan (125 aptamers / SomaScan features). The model is a TabM† student distilled from a TabPFN v2 teacher, so it runs at inference without any TabPFN dependency (small, DUA-friendly artifacts).
Trained on the ROSMAP cohort, panel-matched to the Olink Inflammation companion model.
| Cohort | Platform | Aptamers (features) | Unique proteins | N (persons) | Teacher R² | Student R² | Gap (mean ± sd) |
|---|---|---|---|---|---|---|---|
| ROSMAP | SomaScan | 125 | 85 | 1,611 (1,313) | 0.304 | 0.302 | 0.002 ± 0.006 |
(10-fold person-grouped CV, R². Student fidelity-RMSE ≈ 0.070 in scaled-y.) The TabM† student reproduces its TabPFN v2 teacher to within ~0.002 R² — inside fold-to-fold noise.
Note on counts: SomaScan measures some proteins with more than one aptamer. The model consumes 125 aptamer-level features (the
meta.json→feature_namelist), which map to 85 unique proteins.
Files
predict.py standalone inference script
meta.json feature order, y-scaler, arch config
models/
T0001_model.pkl slim model (~4 KB)
T0001_student_tabm.pt TabM† weights (~8 MB) + quantile bins
results/, *_results.csv aggregate CV metrics summaries
Target T0001 = age_at_visit (chronological age at the visit).
Usage
predict.py runs the model with only public dependencies:
pip install torch tabm rtdl_num_embeddings numpy pandas
python predict.py --input proteins.csv --output ages.csv
--input is a CSV/TSV with one row per sample and one column per aptamer feature,
named exactly as in meta.json → feature_name (125 SomaScan features). Column
order does not matter; an optional sample_id column is carried through. Output
is sample_id, predicted_age. This model expects complete values (no imputer).
The script errors if any required feature is missing and warns about unused
input columns.
Method
- Teacher: TabPFN v2.
- Student: TabM† distilled on a GMM-augmented transfer set (target 10,000 rows) with a 5-quantile pinball regression loss.
- CV: 10-fold person-grouped (verified no person spans >1 fold).
Inference contract
Order the features by meta.json → feature_name; rebuild the TabM model with
piecewise-linear numeric embeddings and load the weights; average the 5-quantile
(trapezoidal-mean) point estimate over the 32 ensemble members; inverse the
y-scaler to recover age.
Citation
Inflammatory aging clock for plasma inflammatory proteins (ROSMAP). Manuscript in
preparation. See also the Olink companion model:
inflammatory-aging-clock/olink-inflammation-92.