Upload results/zeroshot_results_table.md with huggingface_hub
Browse files
results/zeroshot_results_table.md
ADDED
|
@@ -0,0 +1,127 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Raw zero-shot results — per-cell-type table (2026-04-27, 06:00 UTC)
|
| 2 |
+
|
| 3 |
+
All numbers from H100 full-test benches. Lab smoke runs (n=64, Ex-only)
|
| 4 |
+
are listed at the bottom for cross-reference.
|
| 5 |
+
|
| 6 |
+
## T1 — enhancer_generation (full 372,210; 7-cell)
|
| 7 |
+
|
| 8 |
+
### zs_raw — basic metrics
|
| 9 |
+
|
| 10 |
+
| Cell | n | parse | gc_err | length_ratio |
|
| 11 |
+
|---|---:|---:|---:|---:|
|
| 12 |
+
| Ex | 86,088 | 0.999 | **0.115** | **1.627** |
|
| 13 |
+
| Mic | 74,828 | 1.000 | 0.113 | 1.641 |
|
| 14 |
+
| Oli | 63,278 | 1.000 | 0.119 | 1.644 |
|
| 15 |
+
| In | 50,872 | 0.999 | 0.116 | 1.629 |
|
| 16 |
+
| Ast | 48,623 | 1.000 | 0.116 | 1.638 |
|
| 17 |
+
| OPC | 40,162 | 1.000 | 0.115 | 1.643 |
|
| 18 |
+
| End | 8,359 | 1.000 | 0.118 | 1.651 |
|
| 19 |
+
| **AGG** | **372,210** | **0.9996** | **0.116** | **1.637** |
|
| 20 |
+
|
| 21 |
+
### zs_raw — ORACLE metrics (DeepSTARR-7cell, FBD-style)
|
| 22 |
+
|
| 23 |
+
| Cell | n | FID ↓ | argmax_acc | specificity | on-tgt | off-tgt | div_edit |
|
| 24 |
+
|---|---:|---:|---:|---:|---:|---:|---:|
|
| 25 |
+
| **Ex** | 86,013 | 22.08 | **0.404** | **2.86** | 3.91 | 1.05 | 0.580 |
|
| 26 |
+
| **Oli** | 63,252 | 3.19 | 0.333 | 2.22 | 3.33 | 1.11 | 0.585 |
|
| 27 |
+
| Mic | 74,819 | **116.04 âš ** | 0.172 | 2.28 | 3.38 | 1.10 | 0.580 |
|
| 28 |
+
| Ast | 48,623 | 4.01 | 0.146 | 1.85 | 2.98 | 1.14 | 0.581 |
|
| 29 |
+
| In | 50,831 | 2.43 | 0.000 | 0.75 | 1.95 | 1.20 | 0.587 |
|
| 30 |
+
| OPC | 40,162 | **0.97 ✅** | 0.000 | 0.47 | 1.70 | 1.23 | 0.579 |
|
| 31 |
+
| End | 8,358 | 2.17 | 0.000 | **−1.00 ⚠** | 0.34 | 1.34 | 0.583 |
|
| 32 |
+
| **AGG** | **372,058** | **15.46** | **0.204** | **1.87** | 3.00 | 1.13 | 0.580 |
|
| 33 |
+
|
| 34 |
+
Reading: zero-shot Qwen3.5-2B nails Ex / Oli (the dominant cell types in train), collapses to a generic enhancer that off-targets everywhere (specificity ≈ 0 or negative for End / OPC / In). Mic FID=116 = output collapsed to a small set of templates.
|
| 35 |
+
|
| 36 |
+
### zs_enriched — basic metrics
|
| 37 |
+
|
| 38 |
+
| Cell | n | parse | gc_err | length_ratio |
|
| 39 |
+
|---|---:|---:|---:|---:|
|
| 40 |
+
| Ex | 86,088 | 0.999 | 0.128 | 1.670 |
|
| 41 |
+
| Mic | 74,828 | 1.000 | 0.123 | 1.686 |
|
| 42 |
+
| Oli | 63,278 | 1.000 | 0.124 | 1.682 |
|
| 43 |
+
| In | 50,872 | 1.000 | 0.128 | 1.647 |
|
| 44 |
+
| Ast | 48,623 | 1.000 | 0.125 | 1.658 |
|
| 45 |
+
| OPC | 40,162 | 1.000 | 0.122 | 1.657 |
|
| 46 |
+
| End | 8,359 | 1.000 | **0.137 âš ** | 1.700 |
|
| 47 |
+
| **AGG** | **372,210** | **0.9997** | **0.126** | **1.670** |
|
| 48 |
+
|
| 49 |
+
zs_enriched has **higher gc_err and longer length_ratio than zs_raw** at every cell (0.126 vs 0.116 GC, 1.67 vs 1.64 length). The tool-enriched prompt confuses the small zero-shot model rather than helps it.
|
| 50 |
+
|
| 51 |
+
ORACLE metrics for zs_enriched: pending (reaper still scoring).
|
| 52 |
+
|
| 53 |
+
## T2 — pair_prediction (full 744,420; 7-cell)
|
| 54 |
+
|
| 55 |
+
### zs_raw
|
| 56 |
+
|
| 57 |
+
| Cell | n | acc | F1 | precision | recall |
|
| 58 |
+
|---|---:|---:|---:|---:|---:|
|
| 59 |
+
| Ex | 172,176 | 0.500 | 0.0001 | 0.667 | 5e-05 |
|
| 60 |
+
| Mic | 149,656 | 0.500 | 0.000 | 0.500 | 1e-05 |
|
| 61 |
+
| Oli | 126,556 | 0.500 | 0.0001 | 0.800 | 6e-05 |
|
| 62 |
+
| In | 101,744 | 0.500 | 0.0001 | 0.750 | 6e-05 |
|
| 63 |
+
| Ast | 97,246 | 0.500 | 0.000 | 0.000 | 0.000 |
|
| 64 |
+
| OPC | 80,324 | 0.500 | 0.000 | 1.000 | 2e-05 |
|
| 65 |
+
| End | 16,718 | 0.500 | 0.000 | 0.000 | 0.000 |
|
| 66 |
+
| **AGG** | **744,420** | **0.500** | **0.0001** | **0.65** | **3.5e-05** |
|
| 67 |
+
|
| 68 |
+
### zs_enriched
|
| 69 |
+
|
| 70 |
+
| Cell | n | acc | F1 | precision | recall |
|
| 71 |
+
|---|---:|---:|---:|---:|---:|
|
| 72 |
+
| **Ast** | 97,246 | 0.500 | **0.0041** | 0.510 | 0.0021 |
|
| 73 |
+
| Ex | 172,176 | 0.500 | 0.0030 | 0.635 | 0.0015 |
|
| 74 |
+
| In | 101,744 | 0.500 | 0.0022 | 0.538 | 0.0011 |
|
| 75 |
+
| End | 16,718 | 0.500 | 0.0021 | 0.562 | 0.0011 |
|
| 76 |
+
| Oli | 126,556 | 0.500 | 0.0011 | 0.680 | 0.0005 |
|
| 77 |
+
| Mic | 149,656 | 0.500 | 0.0009 | 0.552 | 0.0004 |
|
| 78 |
+
| OPC | 80,324 | 0.500 | 0.0004 | 0.643 | 0.0002 |
|
| 79 |
+
| **AGG** | **744,420** | **0.500** | **0.002** | **0.575** | **0.001** |
|
| 80 |
+
|
| 81 |
+
Reading: zero-shot Qwen3.5-2B is **degenerate on T2**. It almost always predicts `not_paired` → recall ≈ 0.001 even with tool_enriched prompts. The tool_enriched prompt gives a 20× lift (F1 0.0001 → 0.002) but still essentially no signal. The bottleneck (per `lab_message_2026_04_27_v2.md` §2): T2 enriched only scans the **promoter** for TFBS; the enhancer side gets only GC%. Lab is regenerating with both-sides scan on galaxy.
|
| 82 |
+
|
| 83 |
+
## T3 — enhancer_editing
|
| 84 |
+
|
| 85 |
+
T3 zs_raw bench is **still in flight** (the per-task progress bar at 372k/372k was the eval-set total in flight; final flush still in progress). Bench process at 02:26 elapsed. T3 zs_enriched queued.
|
| 86 |
+
|
| 87 |
+
When metrics land they'll be:
|
| 88 |
+
|
| 89 |
+
* `metrics.json` (basic, sequence-overlap vs heuristic gold — INFORMATIONAL, see `t3_evaluation_design.md` §1)
|
| 90 |
+
* `genqual.json` (T1-style FBD/spec/argmax — INFORMATIONAL for T3)
|
| 91 |
+
* **`genqual_t3_oracle.json`** (HEADLINE: within_budget, in_budget_at_5pct, mean_objective_success, transfer_specificity)
|
| 92 |
+
|
| 93 |
+
The reaper auto-scores all three within 30 s of `predictions.jsonl` landing.
|
| 94 |
+
|
| 95 |
+
## Cross-reference: lab smoke results (older, smaller-N)
|
| 96 |
+
|
| 97 |
+
| Experiment | n | Outcome |
|
| 98 |
+
|---|---:|---|
|
| 99 |
+
| T1 zs_raw lab smoke (Ex only) | 64 | parse=1.0, gc_err=0.093, len_ratio=1.83 — sane |
|
| 100 |
+
| T1 zs_enriched lab smoke (Ex) | 64 | parse=1.0, gc_err=0.096, len_ratio=1.62 |
|
| 101 |
+
| T1 lora_raw lab smoke | 64 | **COLLAPSED**: output `CTGCTGCTG…` × 1790 chars, len_ratio=3.64 |
|
| 102 |
+
| T1 lora_enriched lab smoke | 64 | **COLLAPSED**: same `CTGCTG…` mode, len_ratio=3.90 |
|
| 103 |
+
| T2 asym pair NTv3+NT-v2, aux=none | 128 (Ex) | acc=0.773, F1=0.808, P=0.701, R=0.953 |
|
| 104 |
+
| T2 asym pair, aux=supcon_pair | 128 (Ex) | acc=0.719, F1=0.710, P=0.733, R=0.688 |
|
| 105 |
+
| T2 asym pair, aux=tier_aware_supcon | 128 (Ex) | acc=0.711, F1=0.776, P=0.634, **R=1.000** |
|
| 106 |
+
|
| 107 |
+
Lab T1 LoRA collapse is unusable — needs re-train with the sanitiser fix (`bda9ee0`). Lab T2 asym-pair n=128 smokes prove the architecture works (F1≈0.81 with no aux loss); full 744k re-bench is gated on the T2 enhancer-scan regen.
|
| 108 |
+
|
| 109 |
+
## Headline takeaways for the paper draft
|
| 110 |
+
|
| 111 |
+
1. **Zero-shot Qwen3.5-2B is degenerate on T2** (F1 ≈ 0); cell-type-aware fusion-SFT is the real T2 paper claim. Expected lift from N=128 lab smokes: F1 → 0.7–0.8.
|
| 112 |
+
|
| 113 |
+
2. **Zero-shot T1 has cell-type-collapse** (Ex/Oli good, End/OPC/In bad). Specificity at End=−1.0 = the model produces output that is MORE active in non-target cells than in End. This is exactly what cell-context conditioning + fusion-SFT should fix.
|
| 114 |
+
|
| 115 |
+
3. **Tool-enrichment hurts zero-shot T1, helps zero-shot T2** by 20×. Pattern: small models can't filter rich tool_context into useful signal for generation, but they CAN use it for binary classification. After fine-tuning we expect the gap to invert.
|
| 116 |
+
|
| 117 |
+
4. **T1 FID dispersion is huge across cells** (OPC=0.97 best, Mic=116 worst). Ablations to reproduce in fusion-SFT: per-cell FID is the per-cell oracle-grounded number we should report alongside the aggregate.
|
| 118 |
+
|
| 119 |
+
5. **The L=1.64 over-length** is consistent across cells — the model wants to output ~64% more bases than the gold reference. Unified+ntp / unified+mdlm / diffusion modes should fix this (DNA has a hard length budget via `dna_target_length`).
|
| 120 |
+
|
| 121 |
+
## What lands next (auto)
|
| 122 |
+
|
| 123 |
+
* T1 zs_enriched genqual.json (reaper, ~25 min on CPU; faster after vLLM frees the GPU)
|
| 124 |
+
* T3 zs_raw metrics.json + predictions.jsonl when bench writer flushes
|
| 125 |
+
* T3 zs_raw genqual.json + genqual_t3_oracle.json (reaper, ~30 min after preds land)
|
| 126 |
+
* Then T3 zs_enriched start
|
| 127 |
+
* When bench grid exits → post_bench_pipeline.sh fires → Stages 1-7
|