T3 metrics — quick reference

This is the one-pager you read when looking at a T3 results file. Authoritative implementation: scripts/eval_t3_oracle.py. Design rationale (why these and not heuristic-overlap): t3_evaluation_design.md.

Per-row binary outcomes

For each test row the scorer emits these flags:

Flag	Definition	Notes
`within_budget`	`hamming(pred, ref) ≤ row.metadata.edit_budget`	Row's own assigned budget, typically 10 bp absolute.
`length_preserved`	`len(pred) == len(ref)`	Reference is usually 500 bp.
`target_motif_present`	IUPAC regex match for `row.metadata.target_motif` in `pred`	Forward + reverse-complement scan.
`objective_success`	Depends on `row.metadata.edit_type` (see below).	The headline "did this edit do its job" flag.

objective_success per edit_type:

`edit_type`	`objective_success` true iff …
`activity_boost`	`pred_activity_src > ref_activity_src` (oracle activity in source cell type went up)
`cell_type_transfer`	`(pred_tgt − pred_src) − (ref_tgt − ref_src) > 0` (relative shift toward target cell type increased)
`promoter_retarget`	`target_motif_present` (the new TF motif landed in the sequence)

Per-row continuous values

Field	What it measures
`edit_distance`	Absolute Hamming distance pred ↔ ref, in bp.
`edit_distance_pct`	`edit_distance / len(ref)` — fraction of bases changed.
`pred_activity_src`, `pred_activity_tgt`	Oracle activity scores for `pred` in source / target cell type.
`ref_activity_src`, `ref_activity_tgt`	Oracle activity scores for `ref` in source / target cell type.
`activity_delta_src`	`pred_activity_src − ref_activity_src` (used for `activity_boost`).
`activity_relative_shift`	`(pred_tgt − pred_src) − (ref_tgt − ref_src)` (used for `cell_type_transfer`).

Aggregate metrics (paper-table column candidates)

mean_* for each binary flag is the obvious "fraction of rows where the flag fired". The interesting non-trivial aggregates are:

`in_budget_at_5pct` / `in_budget_at_10pct` / `in_budget_at_20pct`

This is the "percentage of edit distance" you asked about.

Definition: fraction of rows where edit_distance ≤ X% of len(ref), ignoring the row's own edit_budget. Lets us compare across rows with different assigned budgets.

For a 500 bp reference enhancer:

Threshold	Bp budget	Interpretation
`in_budget_at_5pct`	≤ 25 bp	Minimal, near-surgical edit.
`in_budget_at_10pct`	≤ 50 bp	Moderate edit.
`in_budget_at_20pct`	≤ 100 bp	Substantial edit (model is rewriting).

So in_budget_at_5pct = 0.85 means 85% of the model's edits change ≤ 5% of the sequence, i.e. the model is making minimal, focused changes.

Why three thresholds? Different downstream applications care about different "small": a SNP-style retarget is OK with 5%, a CRE-style rewrite might allow 20%. Reporting all three lets reviewers pick the threshold that matches their bias. Paper precedent: Lin et al., NeurIPS 2024.

within_budget (no _at_pct suffix) is distinct — it uses the row's own assigned edit_budget (typically 10 bp absolute). within_budget and in_budget_at_5pct can disagree:

within_budget=False, in_budget_at_5pct=True — edit was 12 bp; row's budget was 10 (fail) but 5% of 500 = 25 (pass).
within_budget=True, in_budget_at_5pct=False — only happens when the row's budget exceeds 5% of len(ref); rare for our prod data.

`kmer6_diversity`

Fraction of unique 6-mers across the predicted sequences for the cohort (across rows). Catches "the model collapsed to one motif" mode failure.

`transfer_specificity` (cell_type_transfer rows only)

Fraction of cell_type_transfer rows where the prediction is both:

more active in target than in source, and
more active in target than the reference was in target.

Both required because activating the target alone could still fail the "transfer" intent (the original could already have been more active in target).

`mean_target_motif_pwm_present`, `pwm_n_observed`

Optional supplementary check using a real PWM scan against --meme-file. Falls back to None when the meme database isn't on disk. Confirms the IUPAC regex match isn't a false positive on a random A/C/G/T match.

Per-cell breakdown

per_cell_type repeats the aggregates above with rows bucketed by row.metadata.cell_type (Ex / In / OPC / Ast / Oli / Mic / End). Lets us see whether the model is uniformly OK across cells or biased toward the over-represented Ex cell type.

RFT-specific multi-turn metadata

scripts/rft_t3.py (with --rounds R > 1) adds these fields to each output row's metadata:

Field	Meaning
`rft_rounds_used`	How many sampling rounds actually ran for this row (early-stop trims this when an early round already yielded a winner).
`rft_total_candidates`	Total candidates the model produced for this row across all rounds.
`rft_winner_round`	Round index (0-based) that produced the chosen candidate.
`rft_winner_margin`	The objective margin of the chosen candidate (per `edit_type`: `activity_delta_src` for boost, `activity_relative_shift` for transfer, 1.0 for retarget).
`rft_winner_edit_distance`	Hamming distance of the chosen candidate to ref (bp).
`rft_winner_edit_distance_pct`	Same as a fraction of `len(ref)`.
`rft_source`	`"candidate"` if oracle picked a winner; `"heuristic_fallback"` if no candidate satisfied all constraints and we kept the heuristic gold.

Read these to track:

keep-rate = fraction with rft_source=="candidate"
mean rounds-to-success = mean(rft_rounds_used | rft_source=="candidate")
margin distribution = rft_winner_margin histogram

Where to look

File	What's in it
`runs/exp_t3_*/predict_t3_{raw,enriched}/genqual/genqual_t3_oracle.json`	Aggregate + per-cell oracle metrics for the trained adapter on the test set.
`runs/exp_t3_grid_*/zs_{raw,enriched}/genqual/genqual_t3_oracle.json`	Same metrics for the zero-shot LLM baseline.
`runs/exp_t3_fusion_sft_${STAMP}/rft_filtered_train.jsonl`	Post-RFT training JSONL. Inspect `metadata.rft_*` fields per row to see how many rounds each row needed and the winner margins.