pevo-msa-mlm-19way (model Hub)
EN: Checkpoints and configs for phylogeny-aware MSA masked language models (L7 / N10 registry sweep).
中文: 系统发育感知 MSA 掩码语言模型的权重与配置(L7 / N10 扫描,179 runs)。
| Training & eval code | github.com/jasperyeoh/pevo-msa-primate-genomics |
| Data & thesis tables | jasperyeoh2/pevo-msa-grch38-19way |
| Figure/table → checkpoint map | GitHub EXPERIMENT_MAP.md |
Dataset vs model — which Hub?
| Need | Use |
|---|---|
| Zarr training corpus, MAF, VEP CSV/parquet, Feng summaries | Dataset Hub |
checkpoint-best/, config.json, registry sweep weights |
This model Hub |
Quick download / 快速下载
pip install -U huggingface_hub
# Paper main models (examples)
for id in 22 126 206 207 210; do
hf download jasperyeoh2/pevo-msa-mlm-19way --repo-type model --local-dir models_hf \
--include "base_models/full_training_${id}/**"
done
# Full registry sweep (very large)
hf download jasperyeoh2/pevo-msa-mlm-19way --repo-type model --local-dir models_hf
Eval path convention:
models_hf/base_models/full_training_206/checkpoint-best/
Match config in Git: phylo_msa1/configs/full_training_206.json.
Repository layout
base_models/
full_training_<id>[_<suffix>]/
checkpoint-best/ ← use for VEP / downstream scoring
config.json
training_args.bin
eval_results.json (when uploaded)
configs/ ← mirrors phylo_msa1/configs/*.json
inventory/ ← upload manifests (CSV)
model.safetensors ← convenience single-model export at repo root
~179 runs under base_models/full_training_*.
Paper models → checkpoint paths
| Model | HF path | Git config | Main thesis use |
|---|---|---|---|
| Exp.22 (L7) | base_models/full_training_22/checkpoint-best/ |
phylo_msa1/configs/full_training_22.json |
tab:vep_runs, gnomAD panel |
| Exp.126 (L7) | base_models/full_training_126/checkpoint-best/ |
.../full_training_126.json |
Feng L7 window sweep |
| Exp.206 (N10) | base_models/full_training_206/checkpoint-best/ |
.../full_training_206.json |
strict-v2 main result, gnomAD |
| Exp.207 (N10) | base_models/full_training_207/checkpoint-best/ |
.../full_training_207.json |
leaderboard, gnomAD, matched-width parent |
| Exp.210 (N10) | base_models/full_training_210/checkpoint-best/ |
.../full_training_210.json |
gnomAD panel |
| GPN-MSA | root model.safetensors |
— | baseline in VEP / gnomAD |
| Matched-width arms | full_training_207_{human_only,matched_7way,full_19way}_*/checkpoint-best/ |
ablation configs under downstream_tasks/matched_width_ablation/ |
May 2026 strict-v2 ablation |
VEP metrics CSVs and per-variant parquets live on the dataset Hub (thesis_repro/phylo_msa1_outputs/), not here.
Scoring / VEP reproduction
git clone+pip install -r phylo_msa1/requirements.txt- Download checkpoint(s) from this repo (table above)
- Optional:
hf download .../pevo-msa-grch38-19way --include 'thesis_repro/**'for cohort parquets - Run scoring scripts under
phylo_msa1/scripts/pointing atcheckpoint-best/
To only rebuild published tables without re-scoring: use dataset Hub thesis_repro/ + GitHub thesis_manu_v2_6/scripts/ (Tier A in EXPERIMENT_MAP.md).
Retrain from scratch
- Dataset Hub: download & extract MSAASR Zarr bundles
- This repo: not required until you export new checkpoints
- Git:
phylo_msa1/scripts/+configs/full_training_*.json
Pin Git commit, dataset revision, and model revision for exact metric match.
License
MIT (model card). Respect licenses for gpn/ and phylo_msa1/ dependencies.
- Downloads last month
- 22
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support