pevo-msa-mlm-19way (model Hub)

EN: Checkpoints and configs for phylogeny-aware MSA masked language models (L7 / N10 registry sweep).
中文: 系统发育感知 MSA 掩码语言模型的权重与配置(L7 / N10 扫描,179 runs)。

Training & eval code github.com/jasperyeoh/pevo-msa-primate-genomics
Data & thesis tables jasperyeoh2/pevo-msa-grch38-19way
Figure/table → checkpoint map GitHub EXPERIMENT_MAP.md

Dataset vs model — which Hub?

Need Use
Zarr training corpus, MAF, VEP CSV/parquet, Feng summaries Dataset Hub
checkpoint-best/, config.json, registry sweep weights This model Hub

Quick download / 快速下载

pip install -U huggingface_hub

# Paper main models (examples)
for id in 22 126 206 207 210; do
  hf download jasperyeoh2/pevo-msa-mlm-19way --repo-type model --local-dir models_hf \
    --include "base_models/full_training_${id}/**"
done

# Full registry sweep (very large)
hf download jasperyeoh2/pevo-msa-mlm-19way --repo-type model --local-dir models_hf

Eval path convention:

models_hf/base_models/full_training_206/checkpoint-best/

Match config in Git: phylo_msa1/configs/full_training_206.json.


Repository layout

base_models/
  full_training_<id>[_<suffix>]/
    checkpoint-best/       ← use for VEP / downstream scoring
    config.json
    training_args.bin
    eval_results.json      (when uploaded)
configs/                   ← mirrors phylo_msa1/configs/*.json
inventory/                 ← upload manifests (CSV)
model.safetensors          ← convenience single-model export at repo root

~179 runs under base_models/full_training_*.


Paper models → checkpoint paths

Model HF path Git config Main thesis use
Exp.22 (L7) base_models/full_training_22/checkpoint-best/ phylo_msa1/configs/full_training_22.json tab:vep_runs, gnomAD panel
Exp.126 (L7) base_models/full_training_126/checkpoint-best/ .../full_training_126.json Feng L7 window sweep
Exp.206 (N10) base_models/full_training_206/checkpoint-best/ .../full_training_206.json strict-v2 main result, gnomAD
Exp.207 (N10) base_models/full_training_207/checkpoint-best/ .../full_training_207.json leaderboard, gnomAD, matched-width parent
Exp.210 (N10) base_models/full_training_210/checkpoint-best/ .../full_training_210.json gnomAD panel
GPN-MSA root model.safetensors baseline in VEP / gnomAD
Matched-width arms full_training_207_{human_only,matched_7way,full_19way}_*/checkpoint-best/ ablation configs under downstream_tasks/matched_width_ablation/ May 2026 strict-v2 ablation

VEP metrics CSVs and per-variant parquets live on the dataset Hub (thesis_repro/phylo_msa1_outputs/), not here.


Scoring / VEP reproduction

  1. git clone + pip install -r phylo_msa1/requirements.txt
  2. Download checkpoint(s) from this repo (table above)
  3. Optional: hf download .../pevo-msa-grch38-19way --include 'thesis_repro/**' for cohort parquets
  4. Run scoring scripts under phylo_msa1/scripts/ pointing at checkpoint-best/

To only rebuild published tables without re-scoring: use dataset Hub thesis_repro/ + GitHub thesis_manu_v2_6/scripts/ (Tier A in EXPERIMENT_MAP.md).


Retrain from scratch

  1. Dataset Hub: download & extract MSAASR Zarr bundles
  2. This repo: not required until you export new checkpoints
  3. Git: phylo_msa1/scripts/ + configs/full_training_*.json

Pin Git commit, dataset revision, and model revision for exact metric match.


License

MIT (model card). Respect licenses for gpn/ and phylo_msa1/ dependencies.

Downloads last month
22
Safetensors
Model size
19.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support