pevo-msa-mlm-19way (model Hub)

EN: Checkpoints and configs for phylogeny-aware MSA masked language models (L7 / N10 registry sweep).
中文： 系统发育感知 MSA 掩码语言模型的权重与配置（L7 / N10 扫描，179 runs）。


Training & eval code	github.com/jasperyeoh/pevo-msa-primate-genomics
Data & thesis tables	jasperyeoh2/pevo-msa-grch38-19way
Figure/table → checkpoint map	GitHub `EXPERIMENT_MAP.md`

Dataset vs model — which Hub?

Need	Use
Zarr training corpus, MAF, VEP CSV/parquet, Feng summaries	Dataset Hub
`checkpoint-best/`, `config.json`, registry sweep weights	This model Hub

Quick download / 快速下载

pip install -U huggingface_hub

# Paper main models (examples)
for id in 22 126 206 207 210; do
  hf download jasperyeoh2/pevo-msa-mlm-19way --repo-type model --local-dir models_hf \
    --include "base_models/full_training_${id}/**"
done

# Full registry sweep (very large)
hf download jasperyeoh2/pevo-msa-mlm-19way --repo-type model --local-dir models_hf

Eval path convention:

models_hf/base_models/full_training_206/checkpoint-best/

Match config in Git: phylo_msa1/configs/full_training_206.json.

Repository layout

base_models/
  full_training_<id>[_<suffix>]/
    checkpoint-best/       ← use for VEP / downstream scoring
    config.json
    training_args.bin
    eval_results.json      (when uploaded)
configs/                   ← mirrors phylo_msa1/configs/*.json
inventory/                 ← upload manifests (CSV)
model.safetensors          ← convenience single-model export at repo root

~179 runs under base_models/full_training_*.

Paper models → checkpoint paths

Model	HF path	Git config	Main thesis use
Exp.22 (L7)	`base_models/full_training_22/checkpoint-best/`	`phylo_msa1/configs/full_training_22.json`	`tab:vep_runs`, gnomAD panel
Exp.126 (L7)	`base_models/full_training_126/checkpoint-best/`	`.../full_training_126.json`	Feng L7 window sweep
Exp.206 (N10)	`base_models/full_training_206/checkpoint-best/`	`.../full_training_206.json`	strict-v2 main result, gnomAD
Exp.207 (N10)	`base_models/full_training_207/checkpoint-best/`	`.../full_training_207.json`	leaderboard, gnomAD, matched-width parent
Exp.210 (N10)	`base_models/full_training_210/checkpoint-best/`	`.../full_training_210.json`	gnomAD panel
GPN-MSA	root `model.safetensors`	—	baseline in VEP / gnomAD
Matched-width arms	`full_training_207_{human_only,matched_7way,full_19way}_*/checkpoint-best/`	ablation configs under `downstream_tasks/matched_width_ablation/`	May 2026 strict-v2 ablation

VEP metrics CSVs and per-variant parquets live on the dataset Hub (thesis_repro/phylo_msa1_outputs/), not here.

Scoring / VEP reproduction

git clone + pip install -r phylo_msa1/requirements.txt
Download checkpoint(s) from this repo (table above)
Optional: hf download .../pevo-msa-grch38-19way --include 'thesis_repro/**' for cohort parquets
Run scoring scripts under phylo_msa1/scripts/ pointing at checkpoint-best/

To only rebuild published tables without re-scoring: use dataset Hub thesis_repro/ + GitHub thesis_manu_v2_6/scripts/ (Tier A in EXPERIMENT_MAP.md).

Retrain from scratch

Dataset Hub: download & extract MSAASR Zarr bundles
This repo: not required until you export new checkpoints
Git: phylo_msa1/scripts/ + configs/full_training_*.json

Pin Git commit, dataset revision, and model revision for exact metric match.

License

MIT (model card). Respect licenses for gpn/ and phylo_msa1/ dependencies.

Downloads last month: 22

Safetensors

Model size

19.4M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support