dnathinker-checkpoints / docs /t3_metrics_quickref.md
explcre's picture
Upload docs/t3_metrics_quickref.md with huggingface_hub
9bd3c19 verified

T3 metrics β€” quick reference

This is the one-pager you read when looking at a T3 results file. Authoritative implementation: scripts/eval_t3_oracle.py. Design rationale (why these and not heuristic-overlap): t3_evaluation_design.md.

Per-row binary outcomes

For each test row the scorer emits these flags:

Flag Definition Notes
within_budget hamming(pred, ref) ≀ row.metadata.edit_budget Row's own assigned budget, typically 10 bp absolute.
length_preserved len(pred) == len(ref) Reference is usually 500 bp.
target_motif_present IUPAC regex match for row.metadata.target_motif in pred Forward + reverse-complement scan.
objective_success Depends on row.metadata.edit_type (see below). The headline "did this edit do its job" flag.

objective_success per edit_type:

edit_type objective_success true iff …
activity_boost pred_activity_src > ref_activity_src (oracle activity in source cell type went up)
cell_type_transfer (pred_tgt βˆ’ pred_src) βˆ’ (ref_tgt βˆ’ ref_src) > 0 (relative shift toward target cell type increased)
promoter_retarget target_motif_present (the new TF motif landed in the sequence)

Per-row continuous values

Field What it measures
edit_distance Absolute Hamming distance pred ↔ ref, in bp.
edit_distance_pct edit_distance / len(ref) β€” fraction of bases changed.
pred_activity_src, pred_activity_tgt Oracle activity scores for pred in source / target cell type.
ref_activity_src, ref_activity_tgt Oracle activity scores for ref in source / target cell type.
activity_delta_src pred_activity_src βˆ’ ref_activity_src (used for activity_boost).
activity_relative_shift (pred_tgt βˆ’ pred_src) βˆ’ (ref_tgt βˆ’ ref_src) (used for cell_type_transfer).

Aggregate metrics (paper-table column candidates)

mean_* for each binary flag is the obvious "fraction of rows where the flag fired". The interesting non-trivial aggregates are:

in_budget_at_5pct / in_budget_at_10pct / in_budget_at_20pct

This is the "percentage of edit distance" you asked about.

Definition: fraction of rows where edit_distance ≀ X% of len(ref), ignoring the row's own edit_budget. Lets us compare across rows with different assigned budgets.

For a 500 bp reference enhancer:

Threshold Bp budget Interpretation
in_budget_at_5pct ≀ 25 bp Minimal, near-surgical edit.
in_budget_at_10pct ≀ 50 bp Moderate edit.
in_budget_at_20pct ≀ 100 bp Substantial edit (model is rewriting).

So in_budget_at_5pct = 0.85 means 85% of the model's edits change ≀ 5% of the sequence, i.e. the model is making minimal, focused changes.

Why three thresholds? Different downstream applications care about different "small": a SNP-style retarget is OK with 5%, a CRE-style rewrite might allow 20%. Reporting all three lets reviewers pick the threshold that matches their bias. Paper precedent: Lin et al., NeurIPS 2024.

within_budget (no _at_pct suffix) is distinct β€” it uses the row's own assigned edit_budget (typically 10 bp absolute). within_budget and in_budget_at_5pct can disagree:

  • within_budget=False, in_budget_at_5pct=True β€” edit was 12 bp; row's budget was 10 (fail) but 5% of 500 = 25 (pass).
  • within_budget=True, in_budget_at_5pct=False β€” only happens when the row's budget exceeds 5% of len(ref); rare for our prod data.

kmer6_diversity

Fraction of unique 6-mers across the predicted sequences for the cohort (across rows). Catches "the model collapsed to one motif" mode failure.

transfer_specificity (cell_type_transfer rows only)

Fraction of cell_type_transfer rows where the prediction is both:

  • more active in target than in source, and
  • more active in target than the reference was in target.

Both required because activating the target alone could still fail the "transfer" intent (the original could already have been more active in target).

mean_target_motif_pwm_present, pwm_n_observed

Optional supplementary check using a real PWM scan against --meme-file. Falls back to None when the meme database isn't on disk. Confirms the IUPAC regex match isn't a false positive on a random A/C/G/T match.

Per-cell breakdown

per_cell_type repeats the aggregates above with rows bucketed by row.metadata.cell_type (Ex / In / OPC / Ast / Oli / Mic / End). Lets us see whether the model is uniformly OK across cells or biased toward the over-represented Ex cell type.

RFT-specific multi-turn metadata

scripts/rft_t3.py (with --rounds R > 1) adds these fields to each output row's metadata:

Field Meaning
rft_rounds_used How many sampling rounds actually ran for this row (early-stop trims this when an early round already yielded a winner).
rft_total_candidates Total candidates the model produced for this row across all rounds.
rft_winner_round Round index (0-based) that produced the chosen candidate.
rft_winner_margin The objective margin of the chosen candidate (per edit_type: activity_delta_src for boost, activity_relative_shift for transfer, 1.0 for retarget).
rft_winner_edit_distance Hamming distance of the chosen candidate to ref (bp).
rft_winner_edit_distance_pct Same as a fraction of len(ref).
rft_source "candidate" if oracle picked a winner; "heuristic_fallback" if no candidate satisfied all constraints and we kept the heuristic gold.

Read these to track:

  • keep-rate = fraction with rft_source=="candidate"
  • mean rounds-to-success = mean(rft_rounds_used | rft_source=="candidate")
  • margin distribution = rft_winner_margin histogram

Where to look

File What's in it
runs/exp_t3_*/predict_t3_{raw,enriched}/genqual/genqual_t3_oracle.json Aggregate + per-cell oracle metrics for the trained adapter on the test set.
runs/exp_t3_grid_*/zs_{raw,enriched}/genqual/genqual_t3_oracle.json Same metrics for the zero-shot LLM baseline.
runs/exp_t3_fusion_sft_${STAMP}/rft_filtered_train.jsonl Post-RFT training JSONL. Inspect metadata.rft_* fields per row to see how many rounds each row needed and the winner margins.