214 kB
16 files
Updated about 10 hours ago
Name
Size
2026-06-10_v1-bln600-text
2026-06-11_canvas-rescue
2026-06-11_image-vibe
2026-06-11_moe-baseline
README.md3.01 kB
xet
README.md

Experiment log

One directory per experiment, newest last. Each has a README (design, config, findings) plus any publishable artifacts (metrics, summaries — never raw BLN600 text, which is CC-BY-NC).

Date Experiment One-line result
2026-06-10 v1 text benchmark — DiffusionGemma vs Gemma-4-E4B, 75 BLN600 passages Diffusion wins on CER (0.036 vs 0.042) and is ~8.5× faster; OCR-seeded canvas collapses to copy-through
2026-06-11 image-input vibe check — does the source page image help correction? Grounding is weak and can manufacture false confidence at low resolution; image-only OCR is weak; parked
2026-06-11 canvas-rescue sweep — can t_max / entropy_bound / canvas noise make the OCR-seeded canvas edit instead of copy? (pre-registered) Negative. Knobs break the copy-through (steps 3→33) but editing never becomes correcting: best cell CER 0.063 ≈ doing nothing (0.064), far behind random-canvas 0.030 — and slower. Needs training-time support
2026-06-11 MoE baseline — gemma-4-26B-A4B-it, the parameter-matched AR twin (per João Gante) Quality headline flips: MoE wins CER 0.027 vs 0.035, but DiffusionGemma is ~10× faster at equal capacity. v1 numbers reproduce

Next steps (logged, not yet committed)

Scaled-up v2 eval — data identified, extension decision pending. A survey of post-OCR eval sources found two easy-to-grab additions to full BLN600 (n=600):

  • Overproof datasets 2+3 (overproof.projectcomputing.com/datasets) — 208 hand-corrected newspaper articles (Sydney Morning Herald 1842–1954 via Trove; Chronicling America 1871–1921). Plain-HTTP download, line-aligned OCR‖gold. Different collections and OCR pipelines than BLN600 → generalization test. No formal license (sources are public domain): eval fine, redistribution needs a permission ask. Dataset 3's source pages carry ABBYY ALTO per-word/char confidences — the bridge to a confidence-guided correction experiment.
  • NCSE transcribed articles (DOI 10.5522/04/25805008.v1) — 91 pairs, 40.7k words, 19th-c periodicals, much noisier OCR than BLN600, CC0, purpose-made human gold with published CLOCR-C LLM baselines. Single small zip.

Also on the list: bootstrap confidence intervals in metrics.py (works retroactively on any outputs file) and multiple sampler seeds for the diffusion arm. Rejected as eval gold after the survey: PleIAs/Post-OCR-Correction, ChroniclingAmericaQA, Scrambled Text (all model-generated "gold" — training material only), RETAS (Gutenberg-aligned, not page-faithful), NOD (synthetic noise), ICDAR 2017 EN held in reserve (Google Drive distribution, bespoke license, needs dedup vs ICDAR 2019).

Total size
214 kB
Files
16
Last updated
Jun 11
Pre-warmed CDN
US EU US EU

Contributors