Buckets:
| Name | Size | Uploaded | Xet hash |
|---|---|---|---|
| 2026-06-10_v1-bln600-text | 3 items | ||
| 2026-06-11_canvas-rescue | 3 items | ||
| 2026-06-11_image-vibe | 1 items | ||
| 2026-06-11_moe-baseline | 3 items | ||
| README.md | 3.01 kB xet | cbe6ddc8 |
Experiment log
One directory per experiment, newest last. Each has a README (design, config, findings) plus any publishable artifacts (metrics, summaries — never raw BLN600 text, which is CC-BY-NC).
| Date | Experiment | One-line result |
|---|---|---|
| 2026-06-10 | v1 text benchmark — DiffusionGemma vs Gemma-4-E4B, 75 BLN600 passages | Diffusion wins on CER (0.036 vs 0.042) and is ~8.5× faster; OCR-seeded canvas collapses to copy-through |
| 2026-06-11 | image-input vibe check — does the source page image help correction? | Grounding is weak and can manufacture false confidence at low resolution; image-only OCR is weak; parked |
| 2026-06-11 | canvas-rescue sweep — can t_max / entropy_bound / canvas noise make the OCR-seeded canvas edit instead of copy? (pre-registered) | Negative. Knobs break the copy-through (steps 3→33) but editing never becomes correcting: best cell CER 0.063 ≈ doing nothing (0.064), far behind random-canvas 0.030 — and slower. Needs training-time support |
| 2026-06-11 | MoE baseline — gemma-4-26B-A4B-it, the parameter-matched AR twin (per João Gante) | Quality headline flips: MoE wins CER 0.027 vs 0.035, but DiffusionGemma is ~10× faster at equal capacity. v1 numbers reproduce |
Next steps (logged, not yet committed)
Scaled-up v2 eval — data identified, extension decision pending. A survey of post-OCR eval sources found two easy-to-grab additions to full BLN600 (n=600):
- Overproof datasets 2+3 (overproof.projectcomputing.com/datasets) — 208 hand-corrected newspaper articles (Sydney Morning Herald 1842–1954 via Trove; Chronicling America 1871–1921). Plain-HTTP download, line-aligned OCR‖gold. Different collections and OCR pipelines than BLN600 → generalization test. No formal license (sources are public domain): eval fine, redistribution needs a permission ask. Dataset 3's source pages carry ABBYY ALTO per-word/char confidences — the bridge to a confidence-guided correction experiment.
- NCSE transcribed articles (DOI 10.5522/04/25805008.v1) — 91 pairs, 40.7k words, 19th-c periodicals, much noisier OCR than BLN600, CC0, purpose-made human gold with published CLOCR-C LLM baselines. Single small zip.
Also on the list: bootstrap confidence intervals in metrics.py (works retroactively on any outputs file) and multiple sampler seeds for the diffusion arm. Rejected as eval gold after the survey: PleIAs/Post-OCR-Correction, ChroniclingAmericaQA, Scrambled Text (all model-generated "gold" — training material only), RETAS (Gutenberg-aligned, not page-faithful), NOD (synthetic noise), ICDAR 2017 EN held in reserve (Google Drive distribution, bespoke license, needs dedup vs ICDAR 2019).
- Total size
- 214 kB
- Files
- 16
- Last updated
- Jun 11
- Pre-warmed CDN
- US EU US EU