explcre commited on
Commit
900229b
·
verified ·
1 Parent(s): c03c32a

Upload results/zeroshot_results_table.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. results/zeroshot_results_table.md +127 -0
results/zeroshot_results_table.md ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Raw zero-shot results — per-cell-type table (2026-04-27, 06:00 UTC)
2
+
3
+ All numbers from H100 full-test benches. Lab smoke runs (n=64, Ex-only)
4
+ are listed at the bottom for cross-reference.
5
+
6
+ ## T1 — enhancer_generation (full 372,210; 7-cell)
7
+
8
+ ### zs_raw — basic metrics
9
+
10
+ | Cell | n | parse | gc_err | length_ratio |
11
+ |---|---:|---:|---:|---:|
12
+ | Ex | 86,088 | 0.999 | **0.115** | **1.627** |
13
+ | Mic | 74,828 | 1.000 | 0.113 | 1.641 |
14
+ | Oli | 63,278 | 1.000 | 0.119 | 1.644 |
15
+ | In | 50,872 | 0.999 | 0.116 | 1.629 |
16
+ | Ast | 48,623 | 1.000 | 0.116 | 1.638 |
17
+ | OPC | 40,162 | 1.000 | 0.115 | 1.643 |
18
+ | End | 8,359 | 1.000 | 0.118 | 1.651 |
19
+ | **AGG** | **372,210** | **0.9996** | **0.116** | **1.637** |
20
+
21
+ ### zs_raw — ORACLE metrics (DeepSTARR-7cell, FBD-style)
22
+
23
+ | Cell | n | FID ↓ | argmax_acc | specificity | on-tgt | off-tgt | div_edit |
24
+ |---|---:|---:|---:|---:|---:|---:|---:|
25
+ | **Ex** | 86,013 | 22.08 | **0.404** | **2.86** | 3.91 | 1.05 | 0.580 |
26
+ | **Oli** | 63,252 | 3.19 | 0.333 | 2.22 | 3.33 | 1.11 | 0.585 |
27
+ | Mic | 74,819 | **116.04 âš ** | 0.172 | 2.28 | 3.38 | 1.10 | 0.580 |
28
+ | Ast | 48,623 | 4.01 | 0.146 | 1.85 | 2.98 | 1.14 | 0.581 |
29
+ | In | 50,831 | 2.43 | 0.000 | 0.75 | 1.95 | 1.20 | 0.587 |
30
+ | OPC | 40,162 | **0.97 ✅** | 0.000 | 0.47 | 1.70 | 1.23 | 0.579 |
31
+ | End | 8,358 | 2.17 | 0.000 | **−1.00 ⚠** | 0.34 | 1.34 | 0.583 |
32
+ | **AGG** | **372,058** | **15.46** | **0.204** | **1.87** | 3.00 | 1.13 | 0.580 |
33
+
34
+ Reading: zero-shot Qwen3.5-2B nails Ex / Oli (the dominant cell types in train), collapses to a generic enhancer that off-targets everywhere (specificity ≈ 0 or negative for End / OPC / In). Mic FID=116 = output collapsed to a small set of templates.
35
+
36
+ ### zs_enriched — basic metrics
37
+
38
+ | Cell | n | parse | gc_err | length_ratio |
39
+ |---|---:|---:|---:|---:|
40
+ | Ex | 86,088 | 0.999 | 0.128 | 1.670 |
41
+ | Mic | 74,828 | 1.000 | 0.123 | 1.686 |
42
+ | Oli | 63,278 | 1.000 | 0.124 | 1.682 |
43
+ | In | 50,872 | 1.000 | 0.128 | 1.647 |
44
+ | Ast | 48,623 | 1.000 | 0.125 | 1.658 |
45
+ | OPC | 40,162 | 1.000 | 0.122 | 1.657 |
46
+ | End | 8,359 | 1.000 | **0.137 âš ** | 1.700 |
47
+ | **AGG** | **372,210** | **0.9997** | **0.126** | **1.670** |
48
+
49
+ zs_enriched has **higher gc_err and longer length_ratio than zs_raw** at every cell (0.126 vs 0.116 GC, 1.67 vs 1.64 length). The tool-enriched prompt confuses the small zero-shot model rather than helps it.
50
+
51
+ ORACLE metrics for zs_enriched: pending (reaper still scoring).
52
+
53
+ ## T2 — pair_prediction (full 744,420; 7-cell)
54
+
55
+ ### zs_raw
56
+
57
+ | Cell | n | acc | F1 | precision | recall |
58
+ |---|---:|---:|---:|---:|---:|
59
+ | Ex | 172,176 | 0.500 | 0.0001 | 0.667 | 5e-05 |
60
+ | Mic | 149,656 | 0.500 | 0.000 | 0.500 | 1e-05 |
61
+ | Oli | 126,556 | 0.500 | 0.0001 | 0.800 | 6e-05 |
62
+ | In | 101,744 | 0.500 | 0.0001 | 0.750 | 6e-05 |
63
+ | Ast | 97,246 | 0.500 | 0.000 | 0.000 | 0.000 |
64
+ | OPC | 80,324 | 0.500 | 0.000 | 1.000 | 2e-05 |
65
+ | End | 16,718 | 0.500 | 0.000 | 0.000 | 0.000 |
66
+ | **AGG** | **744,420** | **0.500** | **0.0001** | **0.65** | **3.5e-05** |
67
+
68
+ ### zs_enriched
69
+
70
+ | Cell | n | acc | F1 | precision | recall |
71
+ |---|---:|---:|---:|---:|---:|
72
+ | **Ast** | 97,246 | 0.500 | **0.0041** | 0.510 | 0.0021 |
73
+ | Ex | 172,176 | 0.500 | 0.0030 | 0.635 | 0.0015 |
74
+ | In | 101,744 | 0.500 | 0.0022 | 0.538 | 0.0011 |
75
+ | End | 16,718 | 0.500 | 0.0021 | 0.562 | 0.0011 |
76
+ | Oli | 126,556 | 0.500 | 0.0011 | 0.680 | 0.0005 |
77
+ | Mic | 149,656 | 0.500 | 0.0009 | 0.552 | 0.0004 |
78
+ | OPC | 80,324 | 0.500 | 0.0004 | 0.643 | 0.0002 |
79
+ | **AGG** | **744,420** | **0.500** | **0.002** | **0.575** | **0.001** |
80
+
81
+ Reading: zero-shot Qwen3.5-2B is **degenerate on T2**. It almost always predicts `not_paired` → recall ≈ 0.001 even with tool_enriched prompts. The tool_enriched prompt gives a 20× lift (F1 0.0001 → 0.002) but still essentially no signal. The bottleneck (per `lab_message_2026_04_27_v2.md` §2): T2 enriched only scans the **promoter** for TFBS; the enhancer side gets only GC%. Lab is regenerating with both-sides scan on galaxy.
82
+
83
+ ## T3 — enhancer_editing
84
+
85
+ T3 zs_raw bench is **still in flight** (the per-task progress bar at 372k/372k was the eval-set total in flight; final flush still in progress). Bench process at 02:26 elapsed. T3 zs_enriched queued.
86
+
87
+ When metrics land they'll be:
88
+
89
+ * `metrics.json` (basic, sequence-overlap vs heuristic gold — INFORMATIONAL, see `t3_evaluation_design.md` §1)
90
+ * `genqual.json` (T1-style FBD/spec/argmax — INFORMATIONAL for T3)
91
+ * **`genqual_t3_oracle.json`** (HEADLINE: within_budget, in_budget_at_5pct, mean_objective_success, transfer_specificity)
92
+
93
+ The reaper auto-scores all three within 30 s of `predictions.jsonl` landing.
94
+
95
+ ## Cross-reference: lab smoke results (older, smaller-N)
96
+
97
+ | Experiment | n | Outcome |
98
+ |---|---:|---|
99
+ | T1 zs_raw lab smoke (Ex only) | 64 | parse=1.0, gc_err=0.093, len_ratio=1.83 — sane |
100
+ | T1 zs_enriched lab smoke (Ex) | 64 | parse=1.0, gc_err=0.096, len_ratio=1.62 |
101
+ | T1 lora_raw lab smoke | 64 | **COLLAPSED**: output `CTGCTGCTG…` × 1790 chars, len_ratio=3.64 |
102
+ | T1 lora_enriched lab smoke | 64 | **COLLAPSED**: same `CTGCTG…` mode, len_ratio=3.90 |
103
+ | T2 asym pair NTv3+NT-v2, aux=none | 128 (Ex) | acc=0.773, F1=0.808, P=0.701, R=0.953 |
104
+ | T2 asym pair, aux=supcon_pair | 128 (Ex) | acc=0.719, F1=0.710, P=0.733, R=0.688 |
105
+ | T2 asym pair, aux=tier_aware_supcon | 128 (Ex) | acc=0.711, F1=0.776, P=0.634, **R=1.000** |
106
+
107
+ Lab T1 LoRA collapse is unusable — needs re-train with the sanitiser fix (`bda9ee0`). Lab T2 asym-pair n=128 smokes prove the architecture works (F1≈0.81 with no aux loss); full 744k re-bench is gated on the T2 enhancer-scan regen.
108
+
109
+ ## Headline takeaways for the paper draft
110
+
111
+ 1. **Zero-shot Qwen3.5-2B is degenerate on T2** (F1 ≈ 0); cell-type-aware fusion-SFT is the real T2 paper claim. Expected lift from N=128 lab smokes: F1 → 0.7–0.8.
112
+
113
+ 2. **Zero-shot T1 has cell-type-collapse** (Ex/Oli good, End/OPC/In bad). Specificity at End=−1.0 = the model produces output that is MORE active in non-target cells than in End. This is exactly what cell-context conditioning + fusion-SFT should fix.
114
+
115
+ 3. **Tool-enrichment hurts zero-shot T1, helps zero-shot T2** by 20×. Pattern: small models can't filter rich tool_context into useful signal for generation, but they CAN use it for binary classification. After fine-tuning we expect the gap to invert.
116
+
117
+ 4. **T1 FID dispersion is huge across cells** (OPC=0.97 best, Mic=116 worst). Ablations to reproduce in fusion-SFT: per-cell FID is the per-cell oracle-grounded number we should report alongside the aggregate.
118
+
119
+ 5. **The L=1.64 over-length** is consistent across cells — the model wants to output ~64% more bases than the gold reference. Unified+ntp / unified+mdlm / diffusion modes should fix this (DNA has a hard length budget via `dna_target_length`).
120
+
121
+ ## What lands next (auto)
122
+
123
+ * T1 zs_enriched genqual.json (reaper, ~25 min on CPU; faster after vLLM frees the GPU)
124
+ * T3 zs_raw metrics.json + predictions.jsonl when bench writer flushes
125
+ * T3 zs_raw genqual.json + genqual_t3_oracle.json (reaper, ~30 min after preds land)
126
+ * Then T3 zs_enriched start
127
+ * When bench grid exits → post_bench_pipeline.sh fires → Stages 1-7