Buckets:

davanstrien
/

diffusiongemma-ocr-bench

davanstrien/diffusiongemma-ocr-bench / experiments

214 kB

16 files

Updated about 10 hours ago

Ctrl+K

Name	Size	Uploaded	Xet hash
2026-06-10_v1-bln600-text		about 19 hours ago	3 items
2026-06-11_canvas-rescue		about 18 hours ago	3 items
2026-06-11_image-vibe		about 19 hours ago	1 items
2026-06-11_moe-baseline		about 10 hours ago	3 items
README.md	3.01 kB xet	about 10 hours ago	cbe6ddc8

README.md

Experiment log

One directory per experiment, newest last. Each has a README (design, config, findings) plus any publishable artifacts (metrics, summaries — never raw BLN600 text, which is CC-BY-NC).

Date	Experiment	One-line result
2026-06-10	v1 text benchmark — DiffusionGemma vs Gemma-4-E4B, 75 BLN600 passages	Diffusion wins on CER (0.036 vs 0.042) and is ~8.5× faster; OCR-seeded canvas collapses to copy-through
2026-06-11	image-input vibe check — does the source page image help correction?	Grounding is weak and can manufacture false confidence at low resolution; image-only OCR is weak; parked
2026-06-11	canvas-rescue sweep — can t_max / entropy_bound / canvas noise make the OCR-seeded canvas edit instead of copy? (pre-registered)	Negative. Knobs break the copy-through (steps 3→33) but editing never becomes correcting: best cell CER 0.063 ≈ doing nothing (0.064), far behind random-canvas 0.030 — and slower. Needs training-time support
2026-06-11	MoE baseline — gemma-4-26B-A4B-it, the parameter-matched AR twin (per João Gante)	Quality headline flips: MoE wins CER 0.027 vs 0.035, but DiffusionGemma is ~10× faster at equal capacity. v1 numbers reproduce

Next steps (logged, not yet committed)

Scaled-up v2 eval — data identified, extension decision pending. A survey of post-OCR eval sources found two easy-to-grab additions to full BLN600 (n=600):

Overproof datasets 2+3 (overproof.projectcomputing.com/datasets) — 208 hand-corrected newspaper articles (Sydney Morning Herald 1842–1954 via Trove; Chronicling America 1871–1921). Plain-HTTP download, line-aligned OCR‖gold. Different collections and OCR pipelines than BLN600 → generalization test. No formal license (sources are public domain): eval fine, redistribution needs a permission ask. Dataset 3's source pages carry ABBYY ALTO per-word/char confidences — the bridge to a confidence-guided correction experiment.
NCSE transcribed articles (DOI 10.5522/04/25805008.v1) — 91 pairs, 40.7k words, 19th-c periodicals, much noisier OCR than BLN600, CC0, purpose-made human gold with published CLOCR-C LLM baselines. Single small zip.

Also on the list: bootstrap confidence intervals in metrics.py (works retroactively on any outputs file) and multiple sampler seeds for the diffusion arm. Rejected as eval gold after the survey: PleIAs/Post-OCR-Correction, ChroniclingAmericaQA, Scrambled Text (all model-generated "gold" — training material only), RETAS (Gutenberg-aligned, not page-faithful), NOD (synthetic noise), ICDAR 2017 EN held in reserve (Google Drive distribution, bespoke license, needs dedup vs ICDAR 2019).

Total size: 214 kB

Files: 16

Last updated: Jun 11

Pre-warmed CDN: US EU US EU

Experiment log

Next steps (logged, not yet committed)

Contributors