explcre
/

dnathinker-checkpoints

Model card Files Files and versions

xet

Community

explcre commited on 14 days ago

Commit

900229b

verified ·

1 Parent(s): c03c32a

Upload results/zeroshot_results_table.md with huggingface_hub

Browse files

Files changed (1) hide show

results/zeroshot_results_table.md +127 -0

results/zeroshot_results_table.md ADDED Viewed

	@@ -0,0 +1,127 @@

+# Raw zero-shot results — per-cell-type table (2026-04-27, 06:00 UTC)
+All numbers from H100 full-test benches. Lab smoke runs (n=64, Ex-only)
+are listed at the bottom for cross-reference.
+## T1 — enhancer_generation (full 372,210; 7-cell)
+### zs_raw — basic metrics
+| Cell | n | parse | gc_err | length_ratio |
+|---|---:|---:|---:|---:|
+| Ex  | 86,088 | 0.999 | **0.115** | **1.627** |
+| Mic | 74,828 | 1.000 | 0.113 | 1.641 |
+| Oli | 63,278 | 1.000 | 0.119 | 1.644 |
+| In  | 50,872 | 0.999 | 0.116 | 1.629 |
+| Ast | 48,623 | 1.000 | 0.116 | 1.638 |
+| OPC | 40,162 | 1.000 | 0.115 | 1.643 |
+| End |  8,359 | 1.000 | 0.118 | 1.651 |
+| **AGG** | **372,210** | **0.9996** | **0.116** | **1.637** |
+### zs_raw — ORACLE metrics (DeepSTARR-7cell, FBD-style)
+| Cell | n | FID ↓ | argmax_acc | specificity | on-tgt | off-tgt | div_edit |
+|---|---:|---:|---:|---:|---:|---:|---:|
+| **Ex**  | 86,013 | 22.08 | **0.404** | **2.86** | 3.91 | 1.05 | 0.580 |
+| **Oli** | 63,252 |  3.19 | 0.333 | 2.22 | 3.33 | 1.11 | 0.585 |
+| Mic | 74,819 | **116.04 ⚠** | 0.172 | 2.28 | 3.38 | 1.10 | 0.580 |
+| Ast | 48,623 |  4.01 | 0.146 | 1.85 | 2.98 | 1.14 | 0.581 |
+| In  | 50,831 |  2.43 | 0.000 | 0.75 | 1.95 | 1.20 | 0.587 |
+| OPC | 40,162 |  **0.97 ✅** | 0.000 | 0.47 | 1.70 | 1.23 | 0.579 |
+| End |  8,358 |  2.17 | 0.000 | **−1.00 ⚠** | 0.34 | 1.34 | 0.583 |
+| **AGG** | **372,058** | **15.46** | **0.204** | **1.87** | 3.00 | 1.13 | 0.580 |
+Reading: zero-shot Qwen3.5-2B nails Ex / Oli (the dominant cell types in train), collapses to a generic enhancer that off-targets everywhere (specificity ≈ 0 or negative for End / OPC / In). Mic FID=116 = output collapsed to a small set of templates.
+### zs_enriched — basic metrics
+| Cell | n | parse | gc_err | length_ratio |
+|---|---:|---:|---:|---:|
+| Ex  | 86,088 | 0.999 | 0.128 | 1.670 |
+| Mic | 74,828 | 1.000 | 0.123 | 1.686 |
+| Oli | 63,278 | 1.000 | 0.124 | 1.682 |
+| In  | 50,872 | 1.000 | 0.128 | 1.647 |
+| Ast | 48,623 | 1.000 | 0.125 | 1.658 |
+| OPC | 40,162 | 1.000 | 0.122 | 1.657 |
+| End |  8,359 | 1.000 | **0.137 ⚠** | 1.700 |
+| **AGG** | **372,210** | **0.9997** | **0.126** | **1.670** |
+zs_enriched has **higher gc_err and longer length_ratio than zs_raw** at every cell (0.126 vs 0.116 GC, 1.67 vs 1.64 length). The tool-enriched prompt confuses the small zero-shot model rather than helps it.
+ORACLE metrics for zs_enriched: pending (reaper still scoring).
+## T2 — pair_prediction (full 744,420; 7-cell)
+### zs_raw
+| Cell | n | acc | F1 | precision | recall |
+|---|---:|---:|---:|---:|---:|
+| Ex  | 172,176 | 0.500 | 0.0001 | 0.667 | 5e-05 |
+| Mic | 149,656 | 0.500 | 0.000 | 0.500 | 1e-05 |
+| Oli | 126,556 | 0.500 | 0.0001 | 0.800 | 6e-05 |
+| In  | 101,744 | 0.500 | 0.0001 | 0.750 | 6e-05 |
+| Ast |  97,246 | 0.500 | 0.000 | 0.000 | 0.000 |
+| OPC |  80,324 | 0.500 | 0.000 | 1.000 | 2e-05 |
+| End |  16,718 | 0.500 | 0.000 | 0.000 | 0.000 |
+| **AGG** | **744,420** | **0.500** | **0.0001** | **0.65** | **3.5e-05** |
+### zs_enriched
+| Cell | n | acc | F1 | precision | recall |
+|---|---:|---:|---:|---:|---:|
+| **Ast** | 97,246 | 0.500 | **0.0041** | 0.510 | 0.0021 |
+| Ex  | 172,176 | 0.500 | 0.0030 | 0.635 | 0.0015 |
+| In  | 101,744 | 0.500 | 0.0022 | 0.538 | 0.0011 |
+| End |  16,718 | 0.500 | 0.0021 | 0.562 | 0.0011 |
+| Oli | 126,556 | 0.500 | 0.0011 | 0.680 | 0.0005 |
+| Mic | 149,656 | 0.500 | 0.0009 | 0.552 | 0.0004 |
+| OPC |  80,324 | 0.500 | 0.0004 | 0.643 | 0.0002 |
+| **AGG** | **744,420** | **0.500** | **0.002** | **0.575** | **0.001** |
+Reading: zero-shot Qwen3.5-2B is **degenerate on T2**. It almost always predicts `not_paired` → recall ≈ 0.001 even with tool_enriched prompts. The tool_enriched prompt gives a 20× lift (F1 0.0001 → 0.002) but still essentially no signal. The bottleneck (per `lab_message_2026_04_27_v2.md` §2): T2 enriched only scans the **promoter** for TFBS; the enhancer side gets only GC%. Lab is regenerating with both-sides scan on galaxy.
+## T3 — enhancer_editing
+T3 zs_raw bench is **still in flight** (the per-task progress bar at 372k/372k was the eval-set total in flight; final flush still in progress). Bench process at 02:26 elapsed. T3 zs_enriched queued.
+When metrics land they'll be:
+* `metrics.json` (basic, sequence-overlap vs heuristic gold — INFORMATIONAL, see `t3_evaluation_design.md` §1)
+* `genqual.json` (T1-style FBD/spec/argmax — INFORMATIONAL for T3)
+* **`genqual_t3_oracle.json`** (HEADLINE: within_budget, in_budget_at_5pct, mean_objective_success, transfer_specificity)
+The reaper auto-scores all three within 30 s of `predictions.jsonl` landing.
+## Cross-reference: lab smoke results (older, smaller-N)
+| Experiment | n | Outcome |
+|---|---:|---|
+| T1 zs_raw lab smoke (Ex only) | 64 | parse=1.0, gc_err=0.093, len_ratio=1.83 — sane |
+| T1 zs_enriched lab smoke (Ex) | 64 | parse=1.0, gc_err=0.096, len_ratio=1.62 |
+| T1 lora_raw lab smoke | 64 | **COLLAPSED**: output `CTGCTGCTG…` × 1790 chars, len_ratio=3.64 |
+| T1 lora_enriched lab smoke | 64 | **COLLAPSED**: same `CTGCTG…` mode, len_ratio=3.90 |
+| T2 asym pair NTv3+NT-v2, aux=none | 128 (Ex) | acc=0.773, F1=0.808, P=0.701, R=0.953 |
+| T2 asym pair, aux=supcon_pair | 128 (Ex) | acc=0.719, F1=0.710, P=0.733, R=0.688 |
+| T2 asym pair, aux=tier_aware_supcon | 128 (Ex) | acc=0.711, F1=0.776, P=0.634, **R=1.000** |
+Lab T1 LoRA collapse is unusable — needs re-train with the sanitiser fix (`bda9ee0`). Lab T2 asym-pair n=128 smokes prove the architecture works (F1≈0.81 with no aux loss); full 744k re-bench is gated on the T2 enhancer-scan regen.
+## Headline takeaways for the paper draft
+1. **Zero-shot Qwen3.5-2B is degenerate on T2** (F1 ≈ 0); cell-type-aware fusion-SFT is the real T2 paper claim. Expected lift from N=128 lab smokes: F1 → 0.7–0.8.
+2. **Zero-shot T1 has cell-type-collapse** (Ex/Oli good, End/OPC/In bad). Specificity at End=−1.0 = the model produces output that is MORE active in non-target cells than in End. This is exactly what cell-context conditioning + fusion-SFT should fix.
+3. **Tool-enrichment hurts zero-shot T1, helps zero-shot T2** by 20×. Pattern: small models can't filter rich tool_context into useful signal for generation, but they CAN use it for binary classification. After fine-tuning we expect the gap to invert.
+4. **T1 FID dispersion is huge across cells** (OPC=0.97 best, Mic=116 worst). Ablations to reproduce in fusion-SFT: per-cell FID is the per-cell oracle-grounded number we should report alongside the aggregate.
+5. **The L=1.64 over-length** is consistent across cells — the model wants to output ~64% more bases than the gold reference. Unified+ntp / unified+mdlm / diffusion modes should fix this (DNA has a hard length budget via `dna_target_length`).
+## What lands next (auto)
+* T1 zs_enriched genqual.json (reaper, ~25 min on CPU; faster after vLLM frees the GPU)
+* T3 zs_raw metrics.json + predictions.jsonl when bench writer flushes
+* T3 zs_raw genqual.json + genqual_t3_oracle.json (reaper, ~30 min after preds land)
+* Then T3 zs_enriched start
+* When bench grid exits → post_bench_pipeline.sh fires → Stages 1-7