Buckets:
| Name | Size | Uploaded | Xet hash |
|---|---|---|---|
| experiments | 11 items | ||
| README.md | 7.69 kB xet | af206d44 | |
| benchmark.py | 20.8 kB xet | d371aeac | |
| canvas_sweep.py | 15.6 kB xet | 3033142b | |
| metrics.py | 10.1 kB xet | 3f3d64a3 | |
| vibe_image.py | 7.76 kB xet | a88e66f4 |
DiffusionGemma vs Gemma-4 on post-OCR correction — experiment scripts
Scripts behind the Post-OCR Gazette demo Space: a first-pass benchmark of google/diffusiongemma-26B-A4B-it (experimental block-diffusion LLM, 26B MoE / 4B active) against autoregressive Gemma-4 baselines (gemma-4-E4B-it, and gemma-4-26B-A4B-it as the parameter-matched MoE arm) on post-OCR correction of 19th-century English newspaper text.
All experiments ran on Hugging Face Jobs (one A100-80GB, bf16, batch 1) — no local GPU involved.
Experiment log
Every experiment is logged under experiments/ — one directory per
run with a README (design, config, findings) and publishable artifacts (summaries,
text-free per-passage metrics). Start with the
experiment index.
Files
benchmark.py— generation only. Self-contained UV script (PEP 723 inline deps): downloads BLN600, samples and align-trims passages to DiffusionGemma's 256-token canvas, runs each model sequentially with timing, writesraw_outputs.jsonl. Includes the OCR-seeded-canvas condition (--canvas-init) via the undocumenteddecoder_input_idshook inDiffusionGemmaForBlockDiffusion.generate().metrics.py— all metrics, computed offline from the JSONL (CER/WER via jiwer, over-correction rate and fix rate via character alignment).uv run metrics.py test.vibe_image.py— small image-input vibe check (text vs image+text vs image-only conditions); see notes in the Space.
Reproduce
# smoke (3 passages, verbose)
hf jobs uv run --flavor a100-large --timeout 45m -e HF_XET_HIGH_PERFORMANCE=1 -s HF_TOKEN \
benchmark.py -- --mode smoke
# full benchmark (all three model arms)
hf jobs uv run --flavor a100-large --timeout 4h -e HF_XET_HIGH_PERFORMANCE=1 -s HF_TOKEN \
benchmark.py -- --mode full --n 75 --models all --out-repo <your-private-dataset-repo>
# metrics (local, CPU)
uv run metrics.py summarize --infile raw_outputs.jsonl --outdir results/
Results (v2, 2026-06-11, n=75 BLN600 passages, all arms one job)
| Condition | CER (input 0.066) | Rel. CER red. | Over-corr. | Fix rate | Median s | tok/s |
|---|---|---|---|---|---|---|
| DiffusionGemma (default) | 0.035 | 49.5% | 1.5% | 86.0% | 1.69 | 119.9 |
| Gemma-4-E4B (greedy) | 0.042 | 45.9% | 0.4% | 61.5% | 15.33 | 12.9 |
| Gemma-4-26B-A4B MoE (greedy) | 0.027 | 62.4% | 0.9% | 87.5% | 16.31 | 12.0 |
The parameter-matched MoE wins on quality; DiffusionGemma is ~10× faster at equal capacity (and reproduces its v1 numbers to within noise). The v1 OCR-seeded-canvas condition (CER 0.081, copy-through collapse) and its attempted rescue are in the experiment log.
Limitations: n=75, single prompt, one run per arm, no significance testing; 256-token block caps passage length; no greedy mode exists for the diffusion sampler; model was one day old at benchmark time.
Data & raw outputs
The eval data is not duplicated in this bucket — grab it from the source (or let the scripts do it: both corpora are fetched and parsed automatically at run time, so a Jobs run needs no data setup):
- BLN600 (CC-BY-NC-4.0): figshare DOI 10.15131/shef.data.25439023
— one ~68 MB zip (password
BLN600) with alignedOCR Text/,Ground Truth/and article-croppedImages/folders. - ICDAR2019 post-OCR (CC-BY-4.0): zenodo record 3515403
—
benchmark.py --dataset icdar; source of the Space's demo passages.
Raw generation outputs (which embed BLN600 text) live in a private, git-versioned dataset repo rather than being mirrored here. The bucket carries the text-free per-passage metrics, which support most reanalysis (bootstrap CIs, significance tests, per-passage plots) — and generation is fully seeded, so the scripts regenerate raw outputs bit-for-bit from the source data.
Picking up this work (humans or agents)
Everything needed to reproduce or extend these experiments is in this bucket — no other repo required. The workflow, end to end:
- Run generation on HF Jobs — directly from this bucket, no download needed.
Every script is self-contained (PEP 723 inline deps) and
hf jobs uv runaccepts a URL; bucket files resolve athttps://huggingface.co/buckets/davanstrien/diffusiongemma-ocr-bench/resolve/<file>. (Needs an HF account with Jobs billing;hf auth login.) Always smoke first — measured costs ona100-large(~$2.50/h):BUCKET=https://huggingface.co/buckets/davanstrien/diffusiongemma-ocr-bench/resolve # smoke: ~8 min wall-clock ≈ $0.35 (model download runs at multi-GB/s, ~2 min for 52 GB) hf jobs uv run --flavor a100-large --timeout 45m -e HF_XET_HIGH_PERFORMANCE=1 -s HF_TOKEN \ $BUCKET/benchmark.py -- --mode smoke # full three-arm benchmark: ~60-80 min ≈ $3 hf jobs uv run --flavor a100-large --timeout 3h -e HF_XET_HIGH_PERFORMANCE=1 -s HF_TOKEN \ $BUCKET/benchmark.py -- --mode full --n 75 --models all --out-repo <your-private-dataset-repo> # canvas sweep screen stage (27-cell factorial + anchor, n=20): ~30-40 min ≈ $1.50 hf jobs uv run --flavor a100-large --timeout 90m -e HF_XET_HIGH_PERFORMANCE=1 -s HF_TOKEN \ $BUCKET/canvas_sweep.py -- --stage screen --out-repo <your-private-dataset-repo> - To modify a script first, copy it locally, edit, and run the local file the
same way:
Raw outputs contain BLN600 text (CC-BY-NC) →hf buckets cp hf://buckets/davanstrien/diffusiongemma-ocr-bench/canvas_sweep.py .--out-repomust be private. Monitor withhf jobs logs <job-id>; list withhf jobs ps. - Compute metrics (CPU, seconds — locally, or as a URL-run cpu-basic job):
Metrics are deliberately decoupled from generation: metric changes never require re-running GPU jobs, and the metric files contain no copyrighted text — safe to publish.hf download <your-private-dataset-repo> raw_outputs.jsonl --repo-type dataset --local-dir . uv run metrics.py summarize --infile raw_outputs.jsonl --outdir results/ uv run metrics.py sweep --infile raw_outputs_canvas_sweep_screen.jsonl --outdir results/ - Log your experiment. Convention: one directory per experiment under
experiments/, namedYYYY-MM-DD_slug/, containing a README (question, design — ideally written before results — findings, including negative ones) plus the text-free metric artifacts. Add a row to the experiment index. Upload with:hf buckets cp -r my-experiment-dir hf://buckets/<your-bucket>/experiments/YYYY-MM-DD_slug
Useful context for new experiments: DiffusionGemma's sampler knobs live in its
generation_config.json (t_max/t_min temperature schedule, EntropyBoundSampler
entropy_bound, confidence_threshold, max_denoising_steps 48, fixed 256-token
canvas). The undocumented decoder_input_ids kwarg of
DiffusionGemmaForBlockDiffusion.generate() replaces the random initial canvas —
see canvas_sweep.py for a worked example, and the v1 experiment README for the
engineering gotchas (output channel markers, .sequences includes the prompt,
streamer subclassing for step counts).
By davanstrien.
- Total size
- 214 kB
- Files
- 16
- Last updated
- Jun 11
- Pre-warmed CDN
- US EU US EU