214 kB
16 files
Updated about 2 hours ago
Name
Size
experiments
README.md7.69 kB
xet
benchmark.py20.8 kB
xet
canvas_sweep.py15.6 kB
xet
metrics.py10.1 kB
xet
vibe_image.py7.76 kB
xet
README.md

DiffusionGemma vs Gemma-4 on post-OCR correction — experiment scripts

Scripts behind the Post-OCR Gazette demo Space: a first-pass benchmark of google/diffusiongemma-26B-A4B-it (experimental block-diffusion LLM, 26B MoE / 4B active) against autoregressive Gemma-4 baselines (gemma-4-E4B-it, and gemma-4-26B-A4B-it as the parameter-matched MoE arm) on post-OCR correction of 19th-century English newspaper text.

All experiments ran on Hugging Face Jobs (one A100-80GB, bf16, batch 1) — no local GPU involved.

Experiment log

Every experiment is logged under experiments/ — one directory per run with a README (design, config, findings) and publishable artifacts (summaries, text-free per-passage metrics). Start with the experiment index.

Files

  • benchmark.py — generation only. Self-contained UV script (PEP 723 inline deps): downloads BLN600, samples and align-trims passages to DiffusionGemma's 256-token canvas, runs each model sequentially with timing, writes raw_outputs.jsonl. Includes the OCR-seeded-canvas condition (--canvas-init) via the undocumented decoder_input_ids hook in DiffusionGemmaForBlockDiffusion.generate().
  • metrics.py — all metrics, computed offline from the JSONL (CER/WER via jiwer, over-correction rate and fix rate via character alignment). uv run metrics.py test.
  • vibe_image.py — small image-input vibe check (text vs image+text vs image-only conditions); see notes in the Space.

Reproduce

# smoke (3 passages, verbose)
hf jobs uv run --flavor a100-large --timeout 45m -e HF_XET_HIGH_PERFORMANCE=1 -s HF_TOKEN \
  benchmark.py -- --mode smoke

# full benchmark (all three model arms)
hf jobs uv run --flavor a100-large --timeout 4h -e HF_XET_HIGH_PERFORMANCE=1 -s HF_TOKEN \
  benchmark.py -- --mode full --n 75 --models all --out-repo <your-private-dataset-repo>

# metrics (local, CPU)
uv run metrics.py summarize --infile raw_outputs.jsonl --outdir results/

Results (v2, 2026-06-11, n=75 BLN600 passages, all arms one job)

Condition CER (input 0.066) Rel. CER red. Over-corr. Fix rate Median s tok/s
DiffusionGemma (default) 0.035 49.5% 1.5% 86.0% 1.69 119.9
Gemma-4-E4B (greedy) 0.042 45.9% 0.4% 61.5% 15.33 12.9
Gemma-4-26B-A4B MoE (greedy) 0.027 62.4% 0.9% 87.5% 16.31 12.0

The parameter-matched MoE wins on quality; DiffusionGemma is ~10× faster at equal capacity (and reproduces its v1 numbers to within noise). The v1 OCR-seeded-canvas condition (CER 0.081, copy-through collapse) and its attempted rescue are in the experiment log.

Limitations: n=75, single prompt, one run per arm, no significance testing; 256-token block caps passage length; no greedy mode exists for the diffusion sampler; model was one day old at benchmark time.

Data & raw outputs

The eval data is not duplicated in this bucket — grab it from the source (or let the scripts do it: both corpora are fetched and parsed automatically at run time, so a Jobs run needs no data setup):

Raw generation outputs (which embed BLN600 text) live in a private, git-versioned dataset repo rather than being mirrored here. The bucket carries the text-free per-passage metrics, which support most reanalysis (bootstrap CIs, significance tests, per-passage plots) — and generation is fully seeded, so the scripts regenerate raw outputs bit-for-bit from the source data.

Picking up this work (humans or agents)

Everything needed to reproduce or extend these experiments is in this bucket — no other repo required. The workflow, end to end:

  1. Run generation on HF Jobs — directly from this bucket, no download needed. Every script is self-contained (PEP 723 inline deps) and hf jobs uv run accepts a URL; bucket files resolve at https://huggingface.co/buckets/davanstrien/diffusiongemma-ocr-bench/resolve/<file>. (Needs an HF account with Jobs billing; hf auth login.) Always smoke first — measured costs on a100-large (~$2.50/h):
    BUCKET=https://huggingface.co/buckets/davanstrien/diffusiongemma-ocr-bench/resolve
    
    # smoke: ~8 min wall-clock ≈ $0.35 (model download runs at multi-GB/s, ~2 min for 52 GB)
    hf jobs uv run --flavor a100-large --timeout 45m -e HF_XET_HIGH_PERFORMANCE=1 -s HF_TOKEN \
      $BUCKET/benchmark.py -- --mode smoke
    
    # full three-arm benchmark: ~60-80 min ≈ $3
    hf jobs uv run --flavor a100-large --timeout 3h -e HF_XET_HIGH_PERFORMANCE=1 -s HF_TOKEN \
      $BUCKET/benchmark.py -- --mode full --n 75 --models all --out-repo <your-private-dataset-repo>
    
    # canvas sweep screen stage (27-cell factorial + anchor, n=20): ~30-40 min ≈ $1.50
    hf jobs uv run --flavor a100-large --timeout 90m -e HF_XET_HIGH_PERFORMANCE=1 -s HF_TOKEN \
      $BUCKET/canvas_sweep.py -- --stage screen --out-repo <your-private-dataset-repo>
    
  2. To modify a script first, copy it locally, edit, and run the local file the same way:
    hf buckets cp hf://buckets/davanstrien/diffusiongemma-ocr-bench/canvas_sweep.py .
    
    Raw outputs contain BLN600 text (CC-BY-NC) → --out-repo must be private. Monitor with hf jobs logs <job-id>; list with hf jobs ps.
  3. Compute metrics (CPU, seconds — locally, or as a URL-run cpu-basic job):
    hf download <your-private-dataset-repo> raw_outputs.jsonl --repo-type dataset --local-dir .
    uv run metrics.py summarize --infile raw_outputs.jsonl --outdir results/
    uv run metrics.py sweep --infile raw_outputs_canvas_sweep_screen.jsonl --outdir results/
    
    Metrics are deliberately decoupled from generation: metric changes never require re-running GPU jobs, and the metric files contain no copyrighted text — safe to publish.
  4. Log your experiment. Convention: one directory per experiment under experiments/, named YYYY-MM-DD_slug/, containing a README (question, design — ideally written before results — findings, including negative ones) plus the text-free metric artifacts. Add a row to the experiment index. Upload with:
    hf buckets cp -r my-experiment-dir hf://buckets/<your-bucket>/experiments/YYYY-MM-DD_slug
    

Useful context for new experiments: DiffusionGemma's sampler knobs live in its generation_config.json (t_max/t_min temperature schedule, EntropyBoundSampler entropy_bound, confidence_threshold, max_denoising_steps 48, fixed 256-token canvas). The undocumented decoder_input_ids kwarg of DiffusionGemmaForBlockDiffusion.generate() replaces the random initial canvas — see canvas_sweep.py for a worked example, and the v1 experiment README for the engineering gotchas (output channel markers, .sequences includes the prompt, streamer subclassing for step counts).

By davanstrien.

Total size
214 kB
Files
16
Last updated
Jun 11
Pre-warmed CDN
US EU US EU

Contributors