Spaces:

arjun10g
/

zeroshotGPU

Running on Zero

Arjunvir Singh commited on 9 days ago

Commit

db06ffa

0 Parent(s):

Initial commit: zeroshotGPU MVP with full eval surface

Profiler, router, parser registry, schema, merger with conflict detection,
verifier (coverage/reading-order/table/figure/formula/chunk-readiness plus
GT-comparison: layout F1, table structure, formula CER, retrieval recall),
iterative repair loop with optional GPU escalation, agentic chunker,
benchmark suite (per-doc + per-parser + ablation + cross-dataset),
Gradio Spaces UI with abuse guards + per-artifact downloads, structured
JSON logging, preflight runner, regression-fixture format with perf
floors, .env loading, pre-commit/pre-push hooks, CONTRIBUTING.md +
docs/space_smoke.md, scripts/run_space_smoke.py runner.

Test count: 240/240 passing.

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.env.example +16 -0
.gitignore +15 -0
.pre-commit-config.yaml +44 -0
CHANGELOG.md +274 -0
CONTRIBUTING.md +235 -0
Makefile +49 -0
README.md +287 -0
app.py +251 -0
configs/default.yaml +159 -0
configs/docling.yaml +29 -0
configs/gpu.yaml +43 -0
configs/parsers.yaml +33 -0
configs/routing.yaml +8 -0
docs/space_smoke.md +269 -0
examples/parse_folder.py +27 -0
examples/parse_pdf.py +25 -0
examples/run_benchmark.py +33 -0
pyproject.toml +41 -0
requirements.txt +33 -0
scripts/__init__.py +0 -0
scripts/run_space_smoke.py +455 -0
tests/__init__.py +1 -0
tests/regression/README.md +97 -0
tests/regression/__init__.py +0 -0
tests/regression/fixtures/markdown_basic.expected.json +31 -0
tests/regression/fixtures/markdown_basic.input.md +14 -0
tests/regression/test_regression.py +255 -0
tests/test_ablation_runner.py +133 -0
tests/test_app.py +141 -0
tests/test_artifacts.py +82 -0
tests/test_benchmark.py +55 -0
tests/test_chunking.py +286 -0
tests/test_cli_help.py +91 -0
tests/test_conflict_detection.py +89 -0
tests/test_cross_dataset.py +123 -0
tests/test_datasets.py +152 -0
tests/test_deployment.py +43 -0
tests/test_docling_parser.py +39 -0
tests/test_embedding_retriever.py +190 -0
tests/test_env_loading.py +110 -0
tests/test_external_parser_adapters.py +69 -0
tests/test_gpu_runner.py +185 -0
tests/test_gpu_runtime.py +47 -0
tests/test_gpu_tasks.py +99 -0
tests/test_layout_f1.py +190 -0
tests/test_logging.py +125 -0
tests/test_markdown_normalizer.py +63 -0
tests/test_marker_parser.py +73 -0
tests/test_merge.py +134 -0
tests/test_parser_disagreement.py +177 -0

.env.example ADDED Viewed

	@@ -0,0 +1,16 @@

+# Copy to .env and fill in. .env is gitignored; .env.example is committed.
+# Loaded automatically by zsgdp.config.load_env_file() when CLI / app starts.
+# Hugging Face Hub access token. Required for gated models like jina-v3
+# (the embedding retriever) and any private model id used in gpu.models.
+# Read transparently by transformers / sentence-transformers when set.
+HF_TOKEN=
+# Logging — see zsgdp/logging_config.py.
+# ZSGDP_LOG_LEVEL=INFO
+# ZSGDP_LOG_JSON=1
+# Pipeline overrides.
+# ZSGDP_CONFIG_PATH=configs/docling.yaml
+# ZSGDP_MAX_UPLOAD_BYTES=52428800
+# ZSGDP_MAX_PAGE_COUNT=200

.gitignore ADDED Viewed

	@@ -0,0 +1,15 @@

+__pycache__/
+*.py[cod]
+.pytest_cache/
+.mypy_cache/
+.ruff_cache/
+.venv/
+venv/
+out/
+parsed/
+benchmarks/results/
+# Secrets — never commit. Loaded by zsgdp.config.load_env_file() at runtime.
+.env
+.env.*
+!.env.example

.pre-commit-config.yaml ADDED Viewed

	@@ -0,0 +1,44 @@

+# Pre-commit and pre-push hooks for zeroshotGPU.
+#
+# Install once with:
+#   python -m pip install pre-commit
+#   pre-commit install --install-hooks --hook-type pre-commit --hook-type pre-push
+#
+# pre-commit runs only fast static checks on every commit so the developer
+# loop stays tight. The slow `preflight` runs at pre-push time so it gates
+# what reaches the remote without slowing down individual commits.
+default_language_version:
+  python: python3.11
+repos:
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v5.0.0
+    hooks:
+      - id: trailing-whitespace
+        stages: [pre-commit]
+      - id: end-of-file-fixer
+        stages: [pre-commit]
+      - id: check-yaml
+        stages: [pre-commit]
+        # The simple YAML in configs/*.yaml uses a tiny subset; check-yaml
+        # is fine. `app_file` etc. in README.md aren't real YAML headers
+        # — they're HF Spaces front-matter and excluded from this hook.
+        exclude: ^README\.md$
+      - id: check-json
+        stages: [pre-commit]
+      - id: check-added-large-files
+        stages: [pre-commit]
+        args: ["--maxkb=2048"]
+      - id: check-merge-conflict
+        stages: [pre-commit]
+  - repo: local
+    hooks:
+      - id: zsgdp-preflight
+        name: zsgdp preflight (unit + regression + space-check + parsers)
+        entry: python -m zsgdp.cli preflight --root .
+        language: system
+        pass_filenames: false
+        stages: [pre-push]
+        always_run: true

CHANGELOG.md ADDED Viewed

	@@ -0,0 +1,274 @@

+# Changelog
+All notable changes to zeroshotGPU. Format follows
+[Keep a Changelog](https://keepachangelog.com/en/1.1.0/); versions follow
+[Semantic Versioning](https://semver.org/spec/v2.0.0.html) but the project is
+pre-1.0 so minor bumps may include breaking changes.
+## [Unreleased]
+### Documentation — README restructured
+- Reorganised into Install → Quick start → Opt-ins → Outputs →
+  Architecture map → Production benchmark numbers → Deployment →
+  Contributing.
+- New "Production benchmark numbers" placeholder table with §29
+  success criteria recalled inline; columns are
+  `Metric / Dataset / Value / Date / Run` so the operator pastes real
+  numbers in after running `make space-smoke` and `make benchmark`
+  on the Space.
+- Optional-extras table (`embedding`, `gpu_repair`, `spaces`)
+  documents what each extra adds and the config flag that requires it.
+- Architecture quick-map turned into a table; one row per top-level
+  module with its responsibility.
+- Deployment section is now a numbered checklist that ends with
+  "update the production-benchmark table."
+### Added — Space smoke validation runner
+- `scripts/run_space_smoke.py` automates the five smokes documented in
+  `docs/space_smoke.md`. One command runs whichever smokes have their
+  deps installed; missing deps surface as `skip` results with explicit
+  `pip install` hints, not crashes.
+- Five smokes: `lexical` (model-free benchmark), `ablation` (per-parser
+  runner), `embedding` (sentence-transformers + jina-v3 lazy-load
+  probe), `gpu_repair` (dry-run plan + repair-loop iteration check —
+  *does not* download multi-GB Qwen weights, defers live invocation
+  to `run-gpu-tasks --execute`), `marker` (binary detection +
+  registry availability).
+- `--strict` mode treats skipped smokes as failures; `--output PATH`
+  emits a structured JSON report with per-smoke `detail`, elapsed
+  seconds, status (`pass`/`fail`/`skip`/`error`), and aggregate
+  summary counts.
+- 14 new tests covering registry membership, report aggregation,
+  text formatting per status, strict-mode skip-as-failure, end-to-end
+  smoke execution for the three model-free smokes, and skip-path
+  structure for the model-dependent ones.
+### Added — per-artifact downloads in the Space UI
+- New "Artifacts" tab in `app.py` exposes each top-level artifact
+  (`parsed_document.json`, `document.md`, `chunks.jsonl`,
+  `quality_report.json`, etc. — 16 candidate files) as an individual
+  download via `gr.Files`. The bundled zip stays as it was for
+  archival, and nested asset dirs (`assets/pages/*.png`,
+  `assets/tables/*.png`) are intentionally excluded from the
+  per-artifact list — they can be large and the zip already covers
+  them.
+- The artifact list is built from `_INDIVIDUAL_ARTIFACT_NAMES` in
+  declaration order so the UI listing is stable across runs. Missing
+  files are silently skipped (different parses emit different subsets;
+  e.g. `conflict_report.json` only when multiple parsers ran).
+- All return paths in `parse_uploaded_document` now go through a
+  single `_empty_outputs(...)` helper so the tuple width can't drift
+  between success and the four error paths. New drift-guard test
+  asserts `len(outputs) == 11` for every error path.
+- Summary JSON now includes `individual_artifact_count`.
+### Added — CLI help with examples
+- Each non-trivial CLI subcommand (`parse`, `parse-folder`, `space-check`,
+  `run-gpu-tasks`, `benchmark`, `benchmark-ablate`, `preflight`,
+  `combine-benchmarks`, `export-chunks`, `validate-artifacts`, plus the
+  top-level help) now ships with an `Examples:` block in its `--help`
+  output. Multi-line shell snippets render via
+  `argparse.RawDescriptionHelpFormatter` + a textwrap-dedent helper so
+  the source-side indentation doesn't leak into the rendered output.
+- `zsgdp run-gpu-tasks --help` now explicitly contrasts the dry-run
+  default against `--execute`, matching the safety contract of
+  `repair.execute_gpu_escalations` in config.
+- 9 new tests guarding: epilog dedent helper, blank-line preservation
+  in epilogs, top-level help lists examples, and per-subcommand
+  examples cover their distinguishing flags (e.g. `benchmark` shows
+  all three dataset modes; `combine-benchmarks` shows label pairing).
+### Added — contributor onboarding
+- `CONTRIBUTING.md` documenting setup, hooks, test layout, fixture
+  format, parser/metric/schema-bump procedure, logging conventions,
+  PR checklist, and an architecture quick-map.
+- `.pre-commit-config.yaml` with two stages:
+  - **pre-commit**: trailing whitespace, end-of-file fixer, JSON/YAML
+    syntax, large-file guard (2 MB cap), merge-conflict markers.
+  - **pre-push**: runs `python -m zsgdp.cli preflight` so failing
+    preflight blocks the push. External hook repo is pinned to a
+    specific tag (no `master`/`HEAD` references).
+- `tests/test_repo_hygiene.py` (6 tests) — guards `.env` is in
+  `.gitignore`, `.env.example` is committed and contains no
+  real-shape secrets, pre-commit config has the preflight hook on
+  the pre-push stage with a pinned external repo, `CONTRIBUTING.md`
+  references the preflight workflow and Space smoke checklist,
+  `CHANGELOG.md` has an `[Unreleased]` section.
+### Added — performance baselines
+- Regression fixture format gains an optional `performance` block:
+  `repeats`, `max_elapsed_seconds`, `min_pages_per_second`,
+  `always_enforce`. The runner parses each fixture N times and compares
+  the median against the floor — the cold-import outlier on the first
+  run is stripped automatically.
+- Default opt-in via `ZSGDP_REGRESSION_PERF=1`; per-fixture override
+  via `always_enforce: true`. Floors are intended as
+  catastrophic-regression guards, not tight perf bars.
+- Seed fixture `markdown_basic` ships with a 2.0s / 0.5pps floor
+  (~80x slack against measured ~6ms median) so it exercises the path
+  without flaking on slow CI.
+- 5 new unit tests for the perf evaluator: max-elapsed and
+  min-pps trip correctly, median strips cold outliers, env-var gating
+  honours `always_enforce`.
+### Added — preflight + secrets
+- Preflight runner: `zsgdp preflight` CLI subcommand and `make preflight`
+  target. Chains `unittest discover`, regression fixtures, `space-check`,
+  and `parsers` registry sanity. `--benchmark` adds an end-to-end smoke
+  against the regression fixtures. Each step's output is suppressed on
+  success and surfaced on failure; one-line summary always printed.
+- `Makefile` with targets `test`, `regression`, `space-check`, `parsers`,
+  `preflight`, `preflight-full`, `benchmark`, `clean`.
+- `.env` loading via `zsgdp.config.load_env_file()`. Read at CLI start
+  and `app.py` import; pre-set environment variables always win. Never
+  overrides Space-side secrets. `.env.example` shipped as the template.
+- `.env`/`.env.*` added to `.gitignore` (`.env.example` whitelisted).
+- `zsgdp.config.hf_token()` resolves `HF_TOKEN`,
+  `HUGGING_FACE_HUB_TOKEN`, `HUGGINGFACE_TOKEN` in priority order.
+### Added — structured logging
+- `zsgdp.logging_config` with idempotent `configure_logging()`. Default
+  level WARNING; opt in via `ZSGDP_LOG_LEVEL`. Optional one-line JSON
+  records via `ZSGDP_LOG_JSON=1`; structured `extra={...}` fields
+  promoted to top-level keys for HF Spaces logs / Loki / Datadog.
+- Wired into pipeline (`parse_start`, `parser_candidate`,
+  `parser_failed`, `parse_end`), repair controller (`repair_iteration`),
+  GPU worker (`gpu_task_executed`, `gpu_task_blocked`), CLI, and
+  `app.py`. App auto-enables JSON mode when `SPACE_ID` is set.
+### Added — deployment-readiness pass
+- Pinned upper bounds on all `requirements.txt` and `pyproject.toml`
+  dependencies. Added explicit `embedding` and `gpu_repair` extras so the
+  optional sentence-transformers / transformers stacks can be installed
+  without dragging the whole spaces extra in.
+- Abuse / cost guards in the Gradio Space entrypoint (`app.py`):
+  `MAX_UPLOAD_BYTES` (50 MB default) and `MAX_PAGE_COUNT` (200 default),
+  both overridable via `ZSGDP_MAX_*` env vars. Oversized uploads are
+  rejected with a clear UI error before parsing starts.
+- `SCHEMA_VERSION` constant and `ParsedDocument.schema_version` field.
+  Surfaced into the artifact manifest as
+  `parsed_document_schema_version` alongside the existing manifest
+  `schema_version`. Validation report echoes both so consumers can gate.
+- Regression fixture format under `tests/regression/`: a YAML-style
+  `*.expected.json` tolerance spec paired with an input document. Runner
+  auto-discovers, asserts on tolerances (counts, score, markdown
+  contains/excludes, repair/disagreement rate ranges). One seed fixture
+  shipped (`markdown_basic`).
+### Added — eval surface
+- Per-parser GT-comparison metrics within a single merged run
+  (`zsgdp/benchmarks/per_parser_metrics.py`). Reads pre-merge candidate
+  snapshots from `parsed.provenance.candidates` and computes layout F1 /
+  table structure / formula CER per parser against the same GT.
+  Surfaced as `per_parser_metrics.csv` and per-doc field
+  `per_parser_metrics`.
+- Per-parser cross-doc leaderboard rollup
+  (`per_parser_gt_leaderboard.csv`) with truth-aware filtering: a metric
+  contributes to a parser's mean only when that parser was actually
+  evaluated against truths for that metric on that document.
+- Cross-dataset comparison (`zsgdp/benchmarks/cross_dataset.py`) with
+  `combine-benchmarks` CLI subcommand. Combines multiple
+  `results.json` summaries into `dataset_summary.csv` and a
+  parser-vs-dataset matrix. Missing metrics surface as `None` rather
+  than 0.0 so callers can distinguish absent from true-zero.
+- Embedding-based retriever (`zsgdp/benchmarks/embedding_retriever.py`)
+  satisfying the `Retriever` protocol. Defaults to lexical (model-free,
+  CI-safe); opt in via `benchmarks.retriever.backend=embedding` in
+  config. Lazy-loads `sentence-transformers` on first use; falls back
+  cleanly when unavailable.
+- Layout F1 against ground-truth bbox annotations
+  (`zsgdp/verify/layout_f1.py`). Class-aware and class-agnostic scores
+  side-by-side, per-category breakdown. DocLayNet COCO and OmniDocBench
+  JSON adapters in `zsgdp/benchmarks/ground_truth.py`.
+- Table structure similarity (`zsgdp/verify/table_structure.py`):
+  shape similarity × multiset cell-content F1, greedy bipartite
+  matching.
+- Formula extraction CER (`zsgdp/verify/formula_extraction.py`):
+  Levenshtein-based, normalized for whitespace and `$`/`$$` delimiters.
+- Retrieval-readiness metrics (`zsgdp/verify/retrieval.py`): recall@k,
+  citation accuracy@k, mean reciprocal rank. Synthetic QA generator
+  (`zsgdp/benchmarks/retrieval.py`) using distinctive sentences.
+- Parser-disagreement rate
+  (`zsgdp/verify/parser_disagreement.py`): conflict count over parser
+  pair count from the merger's existing conflict report.
+- Repair success / regression rates
+  (`zsgdp/verify/repair_success.py`): pre/post issue identity diff;
+  iteration history, score delta, action counts.
+- Parser contribution counts: which parser's elements survived the
+  merge, surfaced as per-doc and aggregate fractions.
+- Parser ablation runner (`zsgdp/benchmarks/ablation_runner.py`) with
+  `benchmark-ablate` CLI subcommand. Runs the benchmark once per parser
+  in isolation plus a merged arm, emits a comparison CSV.
+- Three dataset loaders (`zsgdp/benchmarks/datasets.py`):
+  `custom_folder`, `omnidocbench`, `doclaynet`. `DatasetDocument`
+  dataclass; registry pattern for downstream extension.
+### Added — pipeline
+- Iterative repair loop in `pipeline.py`: bounded by
+  `repair.max_iterations`, terminates on quality-accepted OR
+  no-changes-this-pass. Per-iteration history under
+  `provenance.repair_iterations`.
+- GPU repair escalation wired into `repair/controller.py`. Plans
+  same-schema GPU tasks for invalid tables, OCR/text coverage issues,
+  reading-order failures, and figure issues, then dispatches via
+  `GPUWorker`. Default safe (`repair.gpu_escalation=true,
+  repair.execute_gpu_escalations=false`); flip the second to invoke
+  the configured backend.
+- Per-parser candidate snapshots persisted in
+  `parsed.provenance.candidates` so per-parser GT metrics can be
+  computed without re-parsing.
+- Real Marker and Unstructured normalizers
+  (`zsgdp/normalize/normalize_marker.py` and
+  `normalize_unstructured.py`) wired through `parsers/external.py`.
+### Changed
+- `requirements.txt` no longer pins `torch`; the HF Spaces image
+  preinstalls a CUDA-matched build and pinning here would override it.
+- `--gpu-workers` flag help text clarified — the value is recorded for
+  downstream task-execution accounting but document parsing uses
+  `--workers`.
+- `--dataset` benchmark flag now selects the loader name
+  (default `custom_folder`); `custom`/`folder`/`default` accepted as
+  aliases. Previous behaviour was a freeform reporting label only.
+- Embedding-retriever toy hashing test now uses
+  `hashlib.md5`-based stable hashing instead of `builtins.hash()`,
+  fixing per-process flakiness.
+### Documentation
+- `tests/regression/README.md` documents the fixture format.
+- `configs/default.yaml` and `configs/docling.yaml` annotated to
+  explain the new `repair.execute_gpu_escalations` and the deliberate
+  Docling+PyMuPDF dual-enable for the disagreement metric.
+### Test count
+- 181 tests pass (was 4 at the start of the eval surface work).
+## [0.1.0] — initial MVP
+- Profiler, page router, parser registry (text, pymupdf, docling, plus
+  shell-out adapters for marker / mineru / olmocr / paddleocr /
+  unstructured).
+- Canonical schema (`Element`, `TableObject`, `FigureObject`, `Chunk`,
+  `ParsedDocument`, `QualityReport`).
+- Merger with conflict detection, quality verifier (coverage, reading
+  order, table validity, chunk readiness), deterministic repair
+  controller.
+- Agentic chunker with fixed-token / recursive-structure / parent-child
+  / page-level / table / figure strategies; semantic / late /
+  vision-guided / proposition stubs.
+- Artifact manifest with SHA-256 checksums, `validate-artifacts` CLI.
+- Gradio Spaces entrypoint, `space-check` deployment readiness CLI.

CONTRIBUTING.md ADDED Viewed

	@@ -0,0 +1,235 @@

+# Contributing to zeroshotGPU
+Thanks for working on this. Three things to know up front:
+1. **Run `make preflight` before pushing.** It's the same suite that runs
+   in pre-push if you have the hooks installed (see below). A green
+   preflight is the local signal that the branch is ready for the
+   [Space smoke checklist](docs/space_smoke.md).
+2. **Keep it dependency-light by default.** New runtime dependencies need
+   a corresponding entry in `pyproject.toml` extras and an explicit
+   gate (config flag, lazy import, or feature-detection fallback). The
+   `embedding` extra is the model: opt-in, lazy-imported on first use,
+   raises a clean `RuntimeError` when missing.
+3. **Don't change schema shapes silently.** Bump
+   `zsgdp.schema.SCHEMA_VERSION` whenever the on-disk shape of
+   `parsed_document.json`, `chunks.jsonl`, etc. changes. See
+   [Schema versioning](#schema-versioning) below.
+---
+## Setup
+```bash
+git clone <repo>
+cd "Document Parser"
+python3.11 -m venv .venv && source .venv/bin/activate
+python -m pip install -e ".[pdf,yaml,docling,dev]"
+```
+Optional extras:
+- `.[embedding]` — sentence-transformers + transformers for the embedding
+  retriever. Only needed when you set `benchmarks.retriever.backend=embedding`.
+- `.[gpu_repair]` — transformers for live GPU repair. Only needed when you
+  set `repair.execute_gpu_escalations=true`.
+- `.[spaces]` — mirrors the root `requirements.txt` so an editable install
+  matches a Space deploy.
+Set up `.env` for local secrets:
+```bash
+cp .env.example .env
+# Fill in HF_TOKEN if you need gated models.
+```
+`.env` is gitignored. CLI and `app.py` load it automatically; pre-set
+environment variables always win, so a Space's secrets never get
+overridden by a stray local file.
+---
+## Pre-commit / pre-push hooks
+```bash
+python -m pip install pre-commit
+pre-commit install --install-hooks --hook-type pre-commit --hook-type pre-push
+```
+Two stages:
+- **pre-commit** — fast static checks: trailing whitespace, end-of-file
+  newline, JSON/YAML syntax, large-file guard, merge-conflict markers.
+  Runs on every `git commit`.
+- **pre-push** — runs `python -m zsgdp.cli preflight`. Same as
+  `make preflight`. Failing this blocks the push.
+Skip on a specific commit with `git commit --no-verify` if you genuinely
+need to (e.g. WIP). Skip the pre-push gate with `git push --no-verify`,
+but only if you have a separately verified preflight run.
+---
+## Running tests
+```bash
+make test                 # full unittest discover
+make regression           # snapshot fixture suite
+make preflight            # everything except the benchmark smoke
+make preflight-full       # adds an end-to-end benchmark smoke
+make benchmark            # parses tests/regression/fixtures/ via the CLI
+```
+Or directly:
+```bash
+python -m unittest discover
+python -m unittest tests.regression.test_regression
+python -m zsgdp.cli preflight --root . --benchmark
+```
+Performance regressions are gated behind `ZSGDP_REGRESSION_PERF=1`:
+```bash
+ZSGDP_REGRESSION_PERF=1 python -m unittest tests.regression.test_regression
+```
+See [tests/regression/README.md](tests/regression/README.md) for the
+fixture format including the `performance` block.
+---
+## Adding a regression fixture
+1. Drop the input under `tests/regression/fixtures/<name>.input.<ext>`.
+2. Parse it once locally and inspect the output:
+   ```bash
+   python -m zsgdp.cli parse --input tests/regression/fixtures/<name>.input.<ext> --output /tmp/sanity
+   ```
+3. Hand-write `tests/regression/fixtures/<name>.expected.json` with the
+   tolerances you want to lock down. Prefer ranges over exact counts
+   where reasonable variance exists.
+4. Optional: add a `performance` block with `max_elapsed_seconds` set to
+   ~50–100x your local median (catastrophic-regression guard, not a
+   tight bar).
+5. Run `make regression` to confirm the fixture is picked up.
+---
+## Adding a parser adapter
+1. Subclass `BaseParser` in `zsgdp/parsers/<name>_parser.py` (or extend
+   `external.py` for shell-out adapters).
+2. Set `name`, `supported_file_types`, implement `available()` and
+   `parse(path, profile, config, *, pages=None)`.
+3. Register in `zsgdp/parsers/registry.py`.
+4. If the parser produces Markdown, write a normalizer under
+   `zsgdp/normalize/normalize_<name>.py` that returns a `ParseCandidate`
+   via `normalize_markdown_candidate(...)`.
+5. Add a config block to `configs/default.yaml` with `enabled: false`
+   plus any CLI flags the adapter needs.
+6. Add the dependency to `pyproject.toml` as an optional extra. Don't
+   pin it in the top-level `requirements.txt` unless it's free to
+   install on every Space build.
+---
+## Adding a metric
+Pure metrics live under `zsgdp/verify/`:
+1. Define inputs as plain dicts/lists (not `ParsedDocument`-keyed) so
+   the same metric works on per-parser candidate snapshots, not just
+   the merged document.
+2. Pin definitions in the module docstring — exact denominator,
+   handling of empty inputs, what each return key means.
+3. Surface in `zsgdp/benchmarks/parser_quality.py`:
+   - Add per-document fields to the `doc_record`.
+   - Add aggregated means to the top-level `summary` dict.
+   - Add a per-document CSV writer if it has detail worth its own file.
+4. Add tests for: perfect input, no-match input, partial overlap,
+   vacuous empty/empty case, and a benchmark-integration test that
+   asserts the metric appears in `summary["documents"][0]`.
+---
+## Schema versioning
+`zsgdp.schema.SCHEMA_VERSION` lives in
+[zsgdp/schema/document.py](zsgdp/schema/document.py). It's surfaced into
+`artifact_manifest.json` as `parsed_document_schema_version` so a
+consumer reading old output can gate.
+Bump rules:
+- **Additive change** (new optional field with a default) — bump the
+  patch (1.0 → 1.1).
+- **Breaking change** (renamed/removed field, semantics changed) — bump
+  the major (1.0 → 2.0). Update the regression fixtures in the same
+  PR; downstream consumers will need a migration.
+- **No change** — leave it alone.
+When you bump, add an entry to `CHANGELOG.md` under
+"### Schema" with the version and what changed.
+---
+## Logging
+Use `from zsgdp.logging_config import get_logger` then
+`logger = get_logger(__name__)`. Call `.info`/`.warning`/`.error` with
+structured `extra={...}` fields rather than f-string-formatted messages
+where possible — the JSON formatter promotes `extra` keys to top-level
+fields so the HF Spaces logs page is greppable.
+Default log level is WARNING (CLI summaries unaffected). Opt in with
+`ZSGDP_LOG_LEVEL=INFO` and `ZSGDP_LOG_JSON=1` for Space-style output.
+---
+## Pull request checklist
+Before opening a PR:
+- [ ] `make preflight` passes locally.
+- [ ] If you added a metric, an adapter, or changed the schema, you
+      updated `CHANGELOG.md`.
+- [ ] If you changed parser behavior, you ran `make regression` and any
+      fixture drift is intentional (and the snapshot was regenerated
+      explicitly).
+- [ ] If your change touches GPU/model code paths, you flagged it for
+      Space-side smoke testing in the PR description (the
+      [smoke checklist](docs/space_smoke.md) covers what to run).
+- [ ] You did **not** commit `.env` or any secret. The `.gitignore`
+      should catch this; if you suspect a leak, treat the token as
+      compromised and rotate it.
+---
+## Architecture quick map
+- `zsgdp/profiling/` — page-level features and labels.
+- `zsgdp/routing/` — deterministic page → expert mapping.
+- `zsgdp/parsers/` — adapters; one canonical schema regardless of source.
+- `zsgdp/normalize/` — convert each parser's output into the schema.
+- `zsgdp/merge/` — align candidates, dedupe, detect conflicts.
+- `zsgdp/verify/` — coverage, reading order, table/figure/formula/chunk
+  quality, GT-comparison metrics (layout F1, table structure, formula
+  CER, retrieval recall), parser disagreement and repair success rates.
+- `zsgdp/repair/` — deterministic header/table fixes plus GPU
+  escalation that dispatches to `gpu/worker.py`.
+- `zsgdp/chunking/` — agentic planner + structure-aware / parent-child /
+  table / figure / page chunk builders, with semantic / late /
+  vision-guided / proposition deterministic stubs.
+- `zsgdp/gpu/` — task planning, batching, dry-run worker, transformers
+  and vLLM clients.
+- `zsgdp/benchmarks/` — dataset loaders, metric runners, ablation,
+  cross-dataset comparison, retrieval (lexical + embedding).
+- `zsgdp/cli.py` — single entry point exposing all of the above.
+- `app.py` — Gradio Space front-end.
+The full spec lives in
+[zero_shot_gpu_document_parser_project_spec.md](zero_shot_gpu_document_parser_project_spec.md).
+The 2000-line read isn't required to contribute, but section §10 (schema)
+and §17 (chunking ladder) are worth skimming if you're touching those
+modules.

Makefile ADDED Viewed

	@@ -0,0 +1,49 @@

+PYTHON ?= python3.11
+.PHONY: help test regression space-check parsers preflight preflight-full benchmark space-smoke space-smoke-strict clean
+help:
+	@echo "Targets:"
+	@echo "  test               - run the full unittest discover suite"
+	@echo "  regression         - run the regression fixture snapshot suite"
+	@echo "  space-check        - run the HF Space readiness check"
+	@echo "  parsers            - print the parser registry"
+	@echo "  preflight          - run test + regression + space-check + parsers"
+	@echo "  preflight-full     - preflight + an end-to-end benchmark smoke"
+	@echo "  benchmark          - run zsgdp benchmark against tests/regression/fixtures"
+	@echo "  space-smoke        - run docs/space_smoke.md smokes (deps-permitting)"
+	@echo "  space-smoke-strict - same, but treat skipped smokes as failures"
+	@echo "  clean              - remove __pycache__ and benchmark output"
+test:
+	$(PYTHON) -m unittest discover
+regression:
+	$(PYTHON) -m unittest tests.regression.test_regression -v
+space-check:
+	$(PYTHON) -m zsgdp.cli space-check --root .
+parsers:
+	$(PYTHON) -m zsgdp.cli parsers
+preflight:
+	$(PYTHON) -m zsgdp.cli preflight --root .
+preflight-full:
+	$(PYTHON) -m zsgdp.cli preflight --root . --benchmark
+benchmark:
+	$(PYTHON) -m zsgdp.cli benchmark \
+	  --input tests/regression/fixtures \
+	  --output out/preflight_benchmark
+space-smoke:
+	$(PYTHON) -m scripts.run_space_smoke --output out/space_smoke_report.json
+space-smoke-strict:
+	$(PYTHON) -m scripts.run_space_smoke --strict --output out/space_smoke_report.json
+clean:
+	find . -type d -name __pycache__ -exec rm -rf {} + 2>/dev/null || true
+	rm -rf out/preflight_benchmark

README.md ADDED Viewed

	@@ -0,0 +1,287 @@

+---
+title: zeroshotGPU
+sdk: gradio
+app_file: app.py
+python_version: 3.11
+suggested_hardware: l4x1
+short_description: Agentic zero-shot document parser with parser metrics and chunk artifacts.
+---
+# Zero-Shot GPU Document Parser
+A self-hosted parsing control plane that profiles documents, routes pages to
+parser experts, normalizes outputs, verifies quality with GT-comparison
+metrics, repairs weak regions through a bounded verify/repair loop (with
+optional GPU escalation), and emits auditable parsed-document artifacts plus
+strategy-aware chunks. Implements the project described in
+[`zero_shot_gpu_document_parser_project_spec.md`](zero_shot_gpu_document_parser_project_spec.md).
+The codebase is intentionally dependency-light by default. Text and Markdown
+work with the standard library; PyMuPDF, Docling, Marker, MinerU, olmOCR,
+PaddleOCR, and Unstructured plug in via optional extras. Live GPU repair
+(Qwen2.5-VL-3B) and the embedding retriever (jina-embeddings-v3) are gated
+behind explicit config flags so a fresh clone never silently downloads
+multi-gigabyte weights.
+---
+## Install
+For the local MVP (text + PyMuPDF + Docling):
+```bash
+python -m pip install -e ".[pdf,yaml,docling,dev]"
+```
+Optional extras:
+| Extra         | Adds                                            | Required for                                  |
+|---------------|--------------------------------------------------|-----------------------------------------------|
+| `embedding`   | `sentence-transformers`, `transformers`         | `benchmarks.retriever.backend=embedding`      |
+| `gpu_repair`  | `transformers`                                   | `repair.execute_gpu_escalations=true`         |
+| `spaces`      | mirrors `requirements.txt` for HF Spaces parity  | running `app.py` locally as a Space simulant  |
+External parser CLIs (Marker, MinerU, olmOCR, PaddleOCR) install separately;
+configure each via `parsers.<name>.command`, `output_args`, and `extra_args`
+in your YAML config.
+Secrets:
+```bash
+cp .env.example .env
+# Set HF_TOKEN if you'll use gated models (jina-embeddings-v3, private repos).
+```
+`.env` is gitignored. The CLI and `app.py` load it on startup; pre-set
+environment variables (e.g. Space-side secrets) always win.
+---
+## Quick start
+### Parse one document or a folder
+```bash
+python -m zsgdp.cli parse --input ./docs/sample.md --output ./out/sample
+python -m zsgdp.cli parse-folder --input ./docs --output ./parsed --workers 4
+python -m zsgdp.cli parse --input ./docs/report.pdf --output ./out/report --config configs/docling.yaml
+```
+Each parse writes a full artifact bundle. `parsed_document.json` is the
+canonical record; `chunks.jsonl` is the retrieval-ready output;
+`quality_report.json` carries every metric the verifier computed.
+### Run a benchmark
+```bash
+# Custom corpus, no GT — runs every metric that doesn't need labels:
+python -m zsgdp.cli benchmark --input ./docs --output ./bench
+# Labelled datasets — adds layout F1 / table structure / formula CER:
+python -m zsgdp.cli benchmark --input ./omnidocbench --dataset omnidocbench --output ./bench/omni
+python -m zsgdp.cli benchmark --input ./doclaynet --dataset doclaynet --output ./bench/doclay
+```
+### Compare parsers (ablation)
+```bash
+python -m zsgdp.cli benchmark-ablate \
+  --input ./docs --output ./bench/ablation \
+  --parser docling --parser pymupdf --parser text
+```
+Runs the benchmark once per parser plus a merged arm; emits
+`ablation_comparison.csv`.
+### Compare across datasets
+```bash
+python -m zsgdp.cli combine-benchmarks \
+  --input ./bench/omni --label omnidocbench \
+  --input ./bench/doclay --label doclaynet \
+  --output ./bench/cross
+```
+Emits `dataset_summary.csv` and `parser_matrix.csv` (parser × dataset).
+### Before pushing to a Space — preflight
+```bash
+make preflight              # unit + regression + space-check + parsers (~10s)
+make preflight-full         # ...plus an end-to-end benchmark smoke
+```
+A green preflight is the local signal that the branch is ready for the
+Space. Pre-commit and pre-push hooks (see [CONTRIBUTING.md](CONTRIBUTING.md))
+make this automatic on every `git push`.
+### On the Space — smoke validation
+Once deployed, exercise the deferred GPU/model paths:
+```bash
+make space-smoke            # runs whichever of 5 smokes have their deps
+python -m scripts.run_space_smoke --strict --output ./space_smoke.json
+```
+See [docs/space_smoke.md](docs/space_smoke.md) for the manual fallback
+procedure (real PDF uploads, full Marker parses) and per-smoke
+acceptance criteria.
+---
+## Opt-ins
+### Embedding retriever
+Default retriever is lexical TF-IDF (zero deps). To use a real embedder:
+```yaml
+# configs/myrun.yaml
+benchmarks:
+  retriever:
+    backend: embedding
+    model_id: jinaai/jina-embeddings-v3   # or any sentence-transformers model
+    task: retrieval.passage
+```
+```bash
+python -m pip install -e ".[embedding]"
+python -m zsgdp.cli benchmark --input ./docs --output ./bench --config configs/myrun.yaml
+```
+The first call lazy-loads the model; subsequent calls reuse it in-process.
+Set `HF_TOKEN` in `.env` for gated models.
+### Live GPU repair
+The repair controller plans GPU tasks for verification failures (invalid
+tables, OCR coverage gaps, reading-order issues, missing figure captions).
+By default these are dry-run only. To execute:
+```yaml
+# configs/myrun.yaml
+repair:
+  gpu_escalation: true
+  execute_gpu_escalations: true     # invokes the configured backend
+gpu:
+  backend: transformers              # or "vllm" for OpenAI-compat
+  models:
+    table:
+      model_id: Qwen/Qwen2.5-VL-3B-Instruct
+```
+Each executed task writes its output back into the merged document with a
+`gpu_repair_task_id` provenance field.
+---
+## Outputs
+Every parse writes:
+- `parsed_document.json` — canonical record (carries `schema_version`).
+- `document.md` — human-readable Markdown reconstruction.
+- `elements.jsonl` / `tables.jsonl` / `figures.jsonl` / `chunks.jsonl` — JSONL streams.
+- `chunking_plan.json` — strategy ladder + per-strategy metadata.
+- `parser_metrics.json` — per-parser candidate-level stats.
+- `quality_report.json` — every verifier metric (text coverage, reading order, table validity, parser disagreement, repair resolution/regression rates, GT-comparison metrics when applicable).
+- `routing_report.json` — page → parser routing decisions.
+- `profile.json` — document profiler output.
+- `gpu_runtime.json` — detected GPU/device state at parse time.
+- `gpu_tasks.jsonl` (when model-backed work is planned) and `gpu_task_report.json` (preflight validation).
+- `conflict_report.json` (when multiple parsers ran).
+- `artifact_manifest.json` with SHA-256 checksums and the parsed-document schema version.
+- `assets/pages/*.png`, `assets/tables/*.png`, `assets/figures/*.png` — rendered PDF page and region crops.
+Benchmark runs additionally write:
+- `results.json` — full structured summary including aggregate means.
+- `leaderboard.csv` and `per_parser_gt_leaderboard.csv` — parser leaderboards (without and with GT comparison).
+- `per_parser_metrics.csv` — per-document, per-parser GT-comparison breakdown.
+- `layout_runs.csv`, `table_structure_runs.csv`, `formula_runs.csv`, `retrieval_runs.csv`, `repair_runs.csv` — per-document detail per metric family.
+- `parser_runs.csv`, `chunk_runs.csv`, `structure_runs.csv`, `chunk_quality.csv`, `throughput_runs.csv`, `ablations.json` — additional detail.
+`benchmark-ablate` adds `ablation_comparison.csv`. `combine-benchmarks`
+adds `dataset_summary.csv`, `parser_matrix.csv`, and
+`cross_dataset_comparison.json`.
+---
+## Architecture map
+| Module                  | Responsibility                                                          |
+|-------------------------|-------------------------------------------------------------------------|
+| `zsgdp/profiling/`      | Cheap per-page features (scanned-score, table density, columns, etc.)   |
+| `zsgdp/routing/`        | Deterministic page → parser-expert decisions with budget                |
+| `zsgdp/parsers/`        | Adapters; one canonical schema regardless of source                     |
+| `zsgdp/normalize/`      | Convert each parser's output into the schema                            |
+| `zsgdp/merge/`          | Align candidates, dedupe, detect conflicts                              |
+| `zsgdp/verify/`         | Coverage / reading order / table / figure / formula / chunk readiness, plus GT-comparison: layout F1, table structure, formula CER, retrieval recall, parser disagreement, repair success |
+| `zsgdp/repair/`         | Deterministic header/table fixes plus GPU escalation through `gpu/worker.py` |
+| `zsgdp/chunking/`       | Agentic planner + structure / parent-child / table / figure / page chunkers, with semantic / late / vision / proposition deterministic stubs |
+| `zsgdp/gpu/`            | Task planning, batching, dry-run worker, transformers + vLLM clients    |
+| `zsgdp/benchmarks/`     | Dataset loaders, metric runners, ablation, cross-dataset, retrieval     |
+| `zsgdp/cli.py`          | All entry points                                                        |
+| `app.py`                | Gradio Space UI                                                         |
+The full spec is in [`zero_shot_gpu_document_parser_project_spec.md`](zero_shot_gpu_document_parser_project_spec.md). §10 (schema) and §17 (chunking ladder) are the most useful sections to skim before touching those modules.
+---
+## Production benchmark numbers
+Once the Space deploy is live and `make space-smoke` is green, run the
+benchmark against your representative corpus and paste the headline
+metrics here. Spec §29 success criteria for reference:
+- **MVP:** full agentic loop improves table QA by ≥20% over best single parser; agentic chunking improves citation accuracy by ≥10% over recursive baseline.
+- **Production-style (HR / financial reports / etc.):** retrieval recall@5 ≥ 90%, citation accuracy ≥ 90%, table QA exactness ≥ 85%, manual review rate ≤ 10%, parser blocking failure rate ≤ 5%.
+| Metric                          | Dataset / Corpus | Value | Date | Run |
+|---------------------------------|------------------|-------|------|-----|
+| `mean_quality_score`            | _todo_           | _todo_| _todo_ | _todo_ |
+| `mean_layout_f1`                | _todo_           | _todo_| _todo_ | _todo_ |
+| `mean_table_structure_score`    | _todo_           | _todo_| _todo_ | _todo_ |
+| `mean_formula_cer`              | _todo_           | _todo_| _todo_ | _todo_ |
+| `mean_retrieval_recall_at_5`    | _todo_           | _todo_| _todo_ | _todo_ |
+| `mean_parser_disagreement_rate` | _todo_           | _todo_| _todo_ | _todo_ |
+| `mean_repair_resolution_rate`   | _todo_           | _todo_| _todo_ | _todo_ |
+| `mean_pages_per_second`         | _todo_           | _todo_| _todo_ | _todo_ |
+Source rows are individual `results.json` files under each Space-side
+benchmark output; commit the directory or a redacted summary so the
+numbers above are reproducible.
+---
+## Deployment
+Targeted: Hugging Face Spaces, hardware `l4x1`, GPU/model target
+`zeroshotGPU`.
+Pre-deploy gate:
+1. `make preflight` (local).
+2. `make preflight-full` (local with end-to-end benchmark smoke).
+3. Duplicate the Space, set `HF_TOKEN` and any other secrets in **Variables and secrets**.
+4. Push.
+5. `make space-smoke` from the Space's JupyterLab terminal.
+6. Inspect [docs/space_smoke.md](docs/space_smoke.md) Smoke 3 (live GPU repair) manually if the runner-level wiring smoke passed but you want full model-invocation validation.
+7. Run `python -m zsgdp.cli benchmark` against your representative corpus and update the table above.
+The Space defaults to `configs/docling.yaml` (Docling + PyMuPDF
+co-enabled so the parser disagreement rate has signal). Override via
+`ZSGDP_CONFIG_PATH` in Space variables for custom configs.
+---
+## Contributing
+See [CONTRIBUTING.md](CONTRIBUTING.md) for setup, hooks, test layout,
+fixture format, parser/metric/schema-bump procedures, and the PR checklist.
+For changes touching the on-disk schema, bump `zsgdp.schema.SCHEMA_VERSION`
+and add an entry under `### Schema` in [CHANGELOG.md](CHANGELOG.md). The
+artifact manifest surfaces the version under
+`parsed_document_schema_version` so downstream consumers can gate.

app.py ADDED Viewed

	@@ -0,0 +1,251 @@

+"""Hugging Face Spaces entrypoint for zeroshotGPU."""
+from __future__ import annotations
+import os
+import shutil
+import tempfile
+from pathlib import Path
+from typing import Any
+try:
+    import gradio as gr
+except ImportError as exc:  # pragma: no cover - only used when launching the Space UI.
+    raise RuntimeError("Gradio is required for the Spaces UI. Install with `python -m pip install -r requirements.txt`.") from exc
+from zsgdp.artifacts import validate_artifact_manifest
+from zsgdp.config import load_config, load_env_file
+from zsgdp.gpu import collect_gpu_runtime_status
+from zsgdp.logging_config import configure_logging, get_logger
+from zsgdp.pipeline import parse_document
+from zsgdp.profiling import profile_document
+# Load .env first so any keys it sets (HF_TOKEN, ZSGDP_LOG_LEVEL, etc.) are
+# visible before we read environment defaults below. Pre-set Space variables
+# always win — load_env_file does not override existing env entries.
+load_env_file()
+# Default to JSON logs on the Space so the HF Spaces logs page is greppable.
+# Override locally with `ZSGDP_LOG_JSON=0` for human-readable text output.
+os.environ.setdefault("ZSGDP_LOG_LEVEL", "INFO")
+os.environ.setdefault("ZSGDP_LOG_JSON", "1" if os.environ.get("SPACE_ID") else "0")
+configure_logging()
+_logger = get_logger(__name__)
+ROOT = Path(__file__).resolve().parent
+DOCLING_CONFIG = ROOT / "configs" / "docling.yaml"
+# Abuse guards. Override at deployment time via env vars to relax for trusted
+# Spaces or tighten further for public ones.
+MAX_UPLOAD_BYTES = int(os.environ.get("ZSGDP_MAX_UPLOAD_BYTES", str(50 * 1024 * 1024)))  # 50 MB
+MAX_PAGE_COUNT = int(os.environ.get("ZSGDP_MAX_PAGE_COUNT", "200"))
+class UploadRejected(Exception):
+    """Raised when an upload exceeds an abuse-guard limit."""
+def _validate_upload(path: Path) -> None:
+    """Reject oversized uploads or PDFs with too many pages before parsing.
+    Cheap to compute (file stat + profiler page count) and avoids spending
+    GPU/CPU minutes on inputs the Space wasn't sized for.
+    """
+    if not path.exists():
+        raise UploadRejected("Uploaded file is missing on disk.")
+    size = path.stat().st_size
+    if size > MAX_UPLOAD_BYTES:
+        raise UploadRejected(
+            f"Upload is {size / 1024 / 1024:.1f} MB; the Space limit is "
+            f"{MAX_UPLOAD_BYTES / 1024 / 1024:.0f} MB. Set ZSGDP_MAX_UPLOAD_BYTES to override."
+        )
+    try:
+        profile = profile_document(path)
+    except Exception:  # pragma: no cover - profiler is robust; this is belt-and-braces.
+        return
+    if profile.page_count > MAX_PAGE_COUNT:
+        raise UploadRejected(
+            f"Document has {profile.page_count} pages; the Space limit is "
+            f"{MAX_PAGE_COUNT}. Set ZSGDP_MAX_PAGE_COUNT to override."
+        )
+# Top-level artifact files surfaced as individual downloads. Nested
+# directories like assets/ stay bundled in the zip only — they can be
+# large for multi-page PDFs and would clutter the per-artifact list.
+_INDIVIDUAL_ARTIFACT_NAMES = (
+    "parsed_document.json",
+    "document.md",
+    "elements.jsonl",
+    "tables.jsonl",
+    "figures.jsonl",
+    "chunks.jsonl",
+    "chunking_plan.json",
+    "parser_metrics.json",
+    "quality_report.json",
+    "routing_report.json",
+    "profile.json",
+    "gpu_runtime.json",
+    "gpu_tasks.jsonl",
+    "gpu_task_report.json",
+    "artifact_manifest.json",
+    "conflict_report.json",
+)
+def _collect_artifact_files(output_dir: Path) -> list[str]:
+    """Return absolute paths for the top-level artifacts the Space surfaces.
+    Order matches _INDIVIDUAL_ARTIFACT_NAMES so the UI listing is stable.
+    Missing files are silently skipped (different parse runs emit different
+    subsets — e.g. conflict_report.json only when multiple parsers ran).
+    """
+    paths: list[str] = []
+    for name in _INDIVIDUAL_ARTIFACT_NAMES:
+        candidate = output_dir / name
+        if candidate.exists():
+            paths.append(str(candidate))
+    return paths
+def _empty_outputs(reason: str, source: Path | None, *, rejected: bool, runtime: dict) -> tuple:
+    """Return-shape used for every error path. Centralised so the tuple width
+    can't drift between the success path and the four error paths."""
+    summary: dict[str, Any] = {"error": reason}
+    if source is not None:
+        summary["source"] = str(source)
+    if rejected:
+        summary["rejected"] = True
+    return ("", summary, {}, {}, {}, runtime, [], {}, {}, None, [])
+def parse_uploaded_document(file_obj: Any, pipeline_mode: str):
+    if file_obj is None:
+        return _empty_outputs("Upload a document first.", None, rejected=False, runtime={})
+    source = Path(file_obj.name)
+    work_dir = Path(tempfile.mkdtemp(prefix="zeroshotgpu_"))
+    output_dir = work_dir / "parsed"
+    config_path = _config_path_for_mode(pipeline_mode)
+    try:
+        _validate_upload(source)
+    except UploadRejected as exc:
+        _logger.warning(
+            "space_upload_rejected",
+            extra={"source_path": str(source), "reason": str(exc)},
+        )
+        runtime = runtime_status_for_mode(pipeline_mode)
+        return _empty_outputs(str(exc), source, rejected=True, runtime=runtime)
+    try:
+        parsed = parse_document(source, output_dir, config_path=config_path)
+    except Exception as exc:  # pragma: no cover - surfaced in the Space UI.
+        runtime = runtime_status_for_mode(pipeline_mode)
+        return _empty_outputs(str(exc), source, rejected=False, runtime=runtime)
+    artifact_validation = validate_artifact_manifest(output_dir)
+    archive_path = shutil.make_archive(str(output_dir), "zip", output_dir)
+    individual_files = _collect_artifact_files(output_dir)
+    runtime = parsed.provenance.get("gpu_runtime", {})
+    summary = {
+        "doc_id": parsed.doc_id,
+        "file_type": parsed.file_type,
+        "elements": len(parsed.elements),
+        "tables": len(parsed.tables),
+        "figures": len(parsed.figures),
+        "chunks": len(parsed.chunks),
+        "quality_score": parsed.quality_report.score,
+        "blocking": parsed.quality_report.has_blocking_failures,
+        "deployment": parsed.provenance.get("config_deployment", {}),
+        "runtime_device": runtime.get("device"),
+        "running_on_huggingface_space": runtime.get("running_on_huggingface_space"),
+        "artifact_manifest_valid": artifact_validation.get("valid"),
+        "artifact_count": artifact_validation.get("artifact_count"),
+        "artifact_checked_count": artifact_validation.get("checked_count"),
+        "individual_artifact_count": len(individual_files),
+    }
+    return (
+        parsed.to_markdown(),
+        summary,
+        parsed.quality_report.to_dict(),
+        parsed.provenance.get("parser_metrics", {}),
+        parsed.provenance.get("chunking", {}),
+        runtime,
+        parsed.provenance.get("gpu_tasks", []),
+        parsed.provenance.get("gpu_task_report", {}),
+        artifact_validation,
+        archive_path,
+        individual_files,
+    )
+def _config_path_for_mode(pipeline_mode: str) -> Path | None:
+    env_config = os.environ.get("ZSGDP_CONFIG_PATH")
+    if env_config:
+        return Path(env_config)
+    if pipeline_mode == "Docling + PyMuPDF" and DOCLING_CONFIG.exists():
+        return DOCLING_CONFIG
+    return None
+def runtime_status_for_mode(pipeline_mode: str) -> dict:
+    return collect_gpu_runtime_status(load_config(_config_path_for_mode(pipeline_mode))).to_dict()
+with gr.Blocks(title="zeroshotGPU") as demo:
+    gr.Markdown("# zeroshotGPU")
+    with gr.Row():
+        upload = gr.File(label="Document", file_types=[".pdf", ".md", ".txt", ".html"])
+        with gr.Column():
+            pipeline = gr.Dropdown(
+                choices=["Docling + PyMuPDF", "Default lightweight"],
+                value="Docling + PyMuPDF",
+                label="Pipeline",
+            )
+            parse_button = gr.Button("Parse", variant="primary")
+            archive = gr.File(label="Artifacts (zip)")
+    with gr.Tabs():
+        with gr.Tab("Markdown"):
+            markdown = gr.Markdown(label="Canonical Markdown")
+        with gr.Tab("Run"):
+            summary = gr.JSON(label="Summary")
+            quality = gr.JSON(label="Quality Report")
+            parser_metrics = gr.JSON(label="Parser Metrics")
+            chunking = gr.JSON(label="Chunking Plan")
+            artifact_validation = gr.JSON(label="Artifact Manifest Validation")
+        with gr.Tab("Artifacts"):
+            gr.Markdown(
+                "Each top-level artifact is downloadable individually. "
+                "Nested assets (page renders, table/figure crops) stay bundled "
+                "in the zip above."
+            )
+            individual_artifacts = gr.Files(label="Individual artifacts")
+        with gr.Tab("Runtime"):
+            runtime = gr.JSON(label="GPU Runtime", value=runtime_status_for_mode("Docling + PyMuPDF"))
+            gpu_tasks = gr.JSON(label="Planned GPU Tasks")
+            gpu_task_report = gr.JSON(label="GPU Task Preflight")
+    parse_button.click(
+        parse_uploaded_document,
+        inputs=[upload, pipeline],
+        outputs=[
+            markdown,
+            summary,
+            quality,
+            parser_metrics,
+            chunking,
+            runtime,
+            gpu_tasks,
+            gpu_task_report,
+            artifact_validation,
+            archive,
+            individual_artifacts,
+        ],
+    )
+if __name__ == "__main__":
+    demo.launch()

configs/default.yaml ADDED Viewed

	@@ -0,0 +1,159 @@

+parsers:
+  text:
+    enabled: true
+  pymupdf:
+    enabled: true
+  docling:
+    enabled: false
+    do_ocr: false
+    do_table_structure: false
+    force_backend_text: true
+  marker:
+    enabled: false
+    command: null
+    timeout_seconds: 300
+    output_args: "--output_dir {output_dir} --output_format markdown"
+    extra_args: ""
+  mineru:
+    enabled: false
+    command: null
+    timeout_seconds: 600
+    output_args: "--output_dir {output_dir}"
+    extra_args: ""
+  olmocr:
+    enabled: false
+    command: null
+    timeout_seconds: 600
+    output_args: "--output_dir {output_dir}"
+    extra_args: ""
+  paddleocr:
+    enabled: false
+    command: null
+    timeout_seconds: 600
+    output_args: "--output_dir {output_dir}"
+    extra_args: ""
+  unstructured:
+    enabled: false
+routing:
+  run_multiple_on_hard_pages: true
+  max_primary_parsers_per_page: 2
+  hard_page_threshold: 0.65
+  scanned_text_threshold: 0.40
+  table_density_threshold: 0.25
+  formula_density_threshold: 0.15
+  figure_density_threshold: 0.20
+repair:
+  enabled: true
+  max_iterations: 3
+  # Plan and dry-run GPU escalations for verification failures.
+  gpu_escalation: true
+  # Actually invoke the configured GPU/VLM backend on flagged regions.
+  # Defaults to false to avoid surprise model downloads on local runs;
+  # set true on the Space once GPU models are warm.
+  execute_gpu_escalations: false
+  table_repair: true
+  reading_order_repair: true
+  figure_repair: true
+  ocr_repair: true
+gpu:
+  backend: transformers
+  provider: huggingface_spaces
+  space_name: zeroshotGPU
+  batch_pages: true
+  validate_tasks: true
+  max_batch_size: 4
+  max_gpu_seconds_per_doc: 120
+  max_vlm_calls_per_doc: 30
+  models:
+    vlm:
+      model_id: Qwen/Qwen2.5-VL-3B-Instruct
+      task: image-text-to-text
+      device: auto
+      dtype: bfloat16
+      max_batch_size: 1
+    ocr:
+      model_id: Qwen/Qwen2.5-VL-3B-Instruct
+      task: document-ocr
+      device: auto
+      dtype: bfloat16
+      max_batch_size: 1
+    table:
+      model_id: Qwen/Qwen2.5-VL-3B-Instruct
+      task: table-repair
+      device: auto
+      dtype: bfloat16
+      max_batch_size: 1
+    embedding:
+      model_id: jinaai/jina-embeddings-v3
+      task: retrieval.passage
+      device: auto
+      dtype: bfloat16
+      max_batch_size: 16
+  task_model_roles:
+    vlm_route_repair: vlm
+    ocr_page: ocr
+    table_vlm_repair: table
+    figure_description: vlm
+pdf:
+  render_pages: true
+  render_dpi: 150
+  crop_tables: true
+  crop_figures: true
+  asset_dir: assets
+quality:
+  accept_threshold: 0.88
+  blocking_failures:
+    - empty_page
+    - invalid_table
+    - missing_text_coverage
+    - reading_order_failure
+chunking:
+  enabled: true
+  planner: agentic
+  baseline_strategy: recursive_structure
+  target_tokens: 512
+  min_tokens: 120
+  overlap_ratio: 0.15
+  parent_child: true
+  parent_target_tokens: 1600
+  page_level_for_paginated_docs: true
+  table_chunks: true
+  figure_chunks: true
+  contextual_prefix: false
+  contextual_retrieval: false
+  semantic_similarity_threshold: 0.18
+  max_propositions_per_source: 8
+  max_proposition_chunks: 64
+  semantic_chunking: false
+  late_chunking: false
+  vision_guided: false
+  agentic_proposition_chunking: false
+  strategy_ladder:
+    - fixed_token_baseline
+    - recursive_structure
+    - metadata_enriched
+    - parent_child
+    - contextual_retrieval
+    - late_chunking
+    - semantic_chunking
+    - vision_guided
+    - agentic_proposition
+benchmarks:
+  retriever:
+    # `lexical` (default, model-free TF-IDF) or `embedding` (sentence-transformers).
+    # The `embedding` backend pulls model_id and task from gpu.models.embedding
+    # unless overridden here. Requires `pip install sentence-transformers`.
+    backend: lexical
+    model_id: null
+    task: null
+deployment:
+  target: huggingface_spaces
+  gpu_models_target: zeroshotGPU

configs/docling.yaml ADDED Viewed

	@@ -0,0 +1,29 @@

+parsers:
+  # Both docling and pymupdf are enabled deliberately so the parser
+  # disagreement-rate metric has a comparison surface on PDF inputs.
+  # Disable one if you only need a single-parser baseline.
+  docling:
+    enabled: true
+    do_ocr: false
+    do_table_structure: false
+    force_backend_text: true
+    generate_page_images: false
+    generate_picture_images: false
+    generate_table_images: false
+    do_picture_description: false
+    do_picture_classification: false
+    do_formula_enrichment: false
+    do_code_enrichment: false
+  marker:
+    enabled: false
+  pymupdf:
+    enabled: true
+routing:
+  run_multiple_on_hard_pages: true
+  max_primary_parsers_per_page: 2
+pdf:
+  render_pages: true
+  crop_tables: true
+  crop_figures: true

configs/gpu.yaml ADDED Viewed

	@@ -0,0 +1,43 @@

+gpu:
+  backend: transformers
+  provider: huggingface_spaces
+  space_name: zeroshotGPU
+  batch_pages: true
+  validate_tasks: true
+  max_batch_size: 4
+  max_gpu_seconds_per_doc: 120
+  max_vlm_calls_per_doc: 30
+  models:
+    vlm:
+      model_id: Qwen/Qwen2.5-VL-3B-Instruct
+      task: image-text-to-text
+      device: auto
+      dtype: bfloat16
+      max_batch_size: 1
+    ocr:
+      model_id: Qwen/Qwen2.5-VL-3B-Instruct
+      task: document-ocr
+      device: auto
+      dtype: bfloat16
+      max_batch_size: 1
+    table:
+      model_id: Qwen/Qwen2.5-VL-3B-Instruct
+      task: table-repair
+      device: auto
+      dtype: bfloat16
+      max_batch_size: 1
+    embedding:
+      model_id: jinaai/jina-embeddings-v3
+      task: retrieval.passage
+      device: auto
+      dtype: bfloat16
+      max_batch_size: 16
+  task_model_roles:
+    vlm_route_repair: vlm
+    ocr_page: ocr
+    table_vlm_repair: table
+    figure_description: vlm
+deployment:
+  target: huggingface_spaces
+  gpu_models_target: zeroshotGPU

configs/parsers.yaml ADDED Viewed

	@@ -0,0 +1,33 @@

+parsers:
+  text:
+    enabled: true
+  pymupdf:
+    enabled: true
+  docling:
+    enabled: false
+  marker:
+    enabled: false
+    command: null
+    timeout_seconds: 300
+    output_args: "--output_dir {output_dir} --output_format markdown"
+    extra_args: ""
+  mineru:
+    enabled: false
+    command: null
+    timeout_seconds: 600
+    output_args: "--output_dir {output_dir}"
+    extra_args: ""
+  olmocr:
+    enabled: false
+    command: null
+    timeout_seconds: 600
+    output_args: "--output_dir {output_dir}"
+    extra_args: ""
+  paddleocr:
+    enabled: false
+    command: null
+    timeout_seconds: 600
+    output_args: "--output_dir {output_dir}"
+    extra_args: ""
+  unstructured:
+    enabled: false

configs/routing.yaml ADDED Viewed

	@@ -0,0 +1,8 @@

+routing:
+  run_multiple_on_hard_pages: true
+  max_primary_parsers_per_page: 2
+  hard_page_threshold: 0.65
+  scanned_text_threshold: 0.40
+  table_density_threshold: 0.25
+  formula_density_threshold: 0.15
+  figure_density_threshold: 0.20

docs/space_smoke.md ADDED Viewed

	@@ -0,0 +1,269 @@

+# Hugging Face Space smoke-test checklist
+This is the deferred deployment-readiness work that can only be exercised on
+real GPU hardware against real models / external CLIs. Run each smoke once
+against a duplicated `zeroshotGPU` Space (or your own dev Space). Each entry
+gives the exact env vars / config flips, the command to trigger, and the
+structured log lines you should expect.
+All log lines below assume the Space is run with `ZSGDP_LOG_LEVEL=INFO` and
+`ZSGDP_LOG_JSON=1`. `app.py` sets these automatically when `SPACE_ID` is in
+the environment, so on a normal Space you do not need to set them yourself.
+The HF Spaces logs page will surface the JSON records on stderr.
+---
+## Pre-flight
+1. Duplicate the Space, give it `l4x1` hardware.
+2. Make sure these are set in **Space settings → Variables and secrets**:
+   - `ZSGDP_LOG_LEVEL=INFO`
+   - `ZSGDP_LOG_JSON=1`
+   - (Optional, only for parser smokes that hit a private repo) `HF_TOKEN`.
+3. In the Space's `requirements.txt`, uncomment the dependency block matching
+   the smoke you are running. Do **one smoke per Space deploy** — combining
+   them risks an OOM or slow cold-start on the L4.
+4. Push and wait for the Space to build. First-build cold-start with a model
+   download is ~5-10 minutes; subsequent restarts are seconds.
+After deploy, watch the **Logs** tab for the `parse_start` event. If you do
+not see structured JSON lines there, the logging config is not active —
+double-check `ZSGDP_LOG_JSON=1` in the Space variables.
+## Automated runner
+Each smoke below has an automated counterpart in
+`scripts/run_space_smoke.py`. From a Space JupyterLab terminal (or any
+shell with the project installed):
+```bash
+# Run all smokes whose deps are installed; skip the rest with hints:
+python -m scripts.run_space_smoke --output ./space_smoke_report.json
+# Run only specific smokes:
+python -m scripts.run_space_smoke --smoke lexical --smoke ablation
+# CI-strict mode: treat skipped smokes as failures (use after you've
+# uncommented the deps for the smoke you intend to run):
+python -m scripts.run_space_smoke --smoke embedding --strict
+```
+The runner reports `pass` / `fail` / `skip` / `error` per smoke, plus
+elapsed seconds and a `detail` block with the metrics it gathered. The
+manual procedure below is the fallback when you want to inspect the UI
+directly or test something the runner doesn't cover (e.g. uploading a
+specific real PDF rather than a synthetic fixture).
+---
+## Smoke 1 — Lexical retriever benchmark (model-free)
+Confirms the Space's parsing + benchmark plumbing works end-to-end before
+adding any model dependency.
+**Setup:**
+- Default `requirements.txt` (no uncommenting needed).
+- Default config (no flips).
+**Trigger:** upload a small markdown file via the Gradio UI.
+**Expected log lines (in order):**
+- `parse_start` with `doc_id`, `file_type`, `device` (likely `cuda`).
+- One `parser_candidate` per parser that ran (typically `text`, possibly
+  `pymupdf` and `docling` if the file was a PDF).
+- Possibly one or more `repair_iteration` records if quality < threshold.
+- `parse_end` with `quality_score`, `repair_iterations`, `chunk_count`.
+**Pass criteria:**
+- All log lines appear with `doc_id` populated.
+- `parse_end.quality_score >= 0.85` for a clean markdown doc.
+- No `parser_failed` or `gpu_task_blocked` records.
+---
+## Smoke 2 — Embedding retriever (jina-embeddings-v3)
+Confirms `sentence-transformers` lazy-load path and that jina-v3 specifically
+runs on the L4 with `trust_remote_code=True`.
+**Setup:**
+- In `requirements.txt`, uncomment `transformers` and `sentence-transformers`
+  lines.
+- Add `configs/space_embedding.yaml` to the repo with:
+  ```yaml
+  benchmarks:
+    retriever:
+      backend: embedding
+      model_id: jinaai/jina-embeddings-v3
+      task: retrieval.passage
+  ```
+- In `app.py` set `os.environ["ZSGDP_CONFIG_PATH"] = "configs/space_embedding.yaml"`,
+  or pass via the env var configured in Space variables.
+**Trigger:** upload any markdown / PDF; the benchmark CLI is not reachable
+from the Gradio UI today, so for the embedding-retriever smoke you'd need
+to run `zsgdp benchmark --input ./fixtures --output ./out` from a Space
+**JupyterLab** session against a small input dir.
+**Expected log lines:**
+- First call: a 30–90s pause while jina-v3 weights download (no log lines
+  during this — torch logs go to its own logger). Then `parse_start`.
+- After the first parse, subsequent calls are fast (model is in memory).
+**Pass criteria:**
+- Benchmark completes without an exception.
+- `summary["mean_retrieval_recall_at_5"] >= 0.7` on a small distinct-text
+  corpus.
+- No `gpu_task_blocked` records (those are repair-related, not retrieval).
+- The parse_end record's `device` field reads `cuda`.
+**Failure modes to watch:**
+- `RuntimeError: EmbeddingRetriever requires sentence-transformers` →
+  package not in `requirements.txt`.
+- CUDA OOM → switch to a smaller embedding model
+  (`sentence-transformers/all-MiniLM-L6-v2`) for the smoke and confirm the
+  wiring before retrying jina-v3.
+---
+## Smoke 3 — Live GPU repair on a malformed table
+Confirms the repair loop's GPU escalation path actually invokes the
+configured VLM and that the result is applied to the merged document.
+**Setup:**
+- In `requirements.txt`, uncomment `transformers` (sentence-transformers
+  not needed for this smoke).
+- Add `configs/space_gpu_repair.yaml`:
+  ```yaml
+  parsers:
+    docling:
+      enabled: true
+    pymupdf:
+      enabled: true
+  repair:
+    enabled: true
+    gpu_escalation: true
+    execute_gpu_escalations: true   # the bit that flips the live path on
+  gpu:
+    backend: transformers
+    models:
+      table:
+        model_id: Qwen/Qwen2.5-VL-3B-Instruct
+        task: table-repair
+        device: auto
+        dtype: bfloat16
+  ```
+- Set `ZSGDP_CONFIG_PATH=configs/space_gpu_repair.yaml` on the Space.
+**Trigger:** upload a PDF that contains a table the parsers will likely
+mangle. A two-column financial statement page works well; if you don't
+have one handy, take a Wikipedia article PDF that has a comparison table.
+**Expected log lines (in order):**
+- `parse_start`.
+- `parser_candidate` for docling and pymupdf (both should fire on a PDF).
+- `repair_iteration` with `iteration=1`, `gpu_task_count >= 1`,
+  `gpu_dry_run=false`.
+- One `gpu_task_executed` record per GPU task. `status` should be
+  `executed` and `elapsed_seconds` 1-10s for a 3B-param VLM on L4.
+- A second `repair_iteration` with `iteration=2` only if iteration 1
+  changed something and quality is still below threshold; otherwise the
+  loop terminates.
+- `parse_end` with `repair_iterations >= 1`.
+**Pass criteria:**
+- At least one `gpu_task_executed` with `status=executed`.
+- The output `parsed_document.json` shows `parsed.tables[i].provenance.gpu_repair_task_id` set.
+- No `gpu_task_blocked` records (would mean missing image_path or doc_id).
+**Failure modes to watch:**
+- All `gpu_task_executed` records show `status=execution_failed` →
+  inspect `output.error` field; common causes are missing image_path
+  (the PDF doesn't render page crops because `pdf.crop_tables=true` isn't
+  set) or a CUDA OOM.
+- No `repair_iteration` records → the verifier didn't flag any
+  blocking issues; pick a different input PDF.
+---
+## Smoke 4 — Per-parser ablation across docling + pymupdf
+Confirms the ablation runner produces a comparison CSV and that each arm's
+artifacts are isolated. No GPU dependency, runs on default Space hardware.
+**Setup:** default config, no requirements.txt changes.
+**Trigger:** Space JupyterLab terminal:
+```bash
+zsgdp benchmark-ablate \
+  --input ./fixtures/pdfs \
+  --output ./out/ablation \
+  --parser docling --parser pymupdf
+```
+**Expected log lines:** one parse cycle per arm (parse_start through
+parse_end), three arms total (docling-only, pymupdf-only, merged).
+**Pass criteria:**
+- `out/ablation/ablation_comparison.csv` has 3 rows.
+- Each arm's `mean_quality_score` is non-zero.
+- The merged arm's `mean_quality_score` is `>= max(per-parser arms)`.
+---
+## Smoke 5 — External parser CLI (Marker)
+The riskiest of the four external adapters because Marker's argv schema
+has changed several times. Per-Space, do not bundle with other smokes.
+**Setup:**
+- Uncomment `marker-pdf` in `requirements.txt`.
+- Add `configs/space_marker.yaml`:
+  ```yaml
+  parsers:
+    text:
+      enabled: false
+    pymupdf:
+      enabled: false
+    marker:
+      enabled: true
+      timeout_seconds: 300
+      output_args: ["--output_dir", "{output_dir}", "--output_format", "markdown"]
+      extra_args: []
+  ```
+- Set `ZSGDP_CONFIG_PATH=configs/space_marker.yaml`.
+**Trigger:** upload a small PDF (1–3 pages) via the Gradio UI.
+**Expected log lines:**
+- `parse_start`.
+- `parser_candidate` for `marker` with non-zero `element_count`.
+- `parse_end` with `candidate_parsers=["marker"]`.
+**Pass criteria:**
+- No `parser_failed` record for marker.
+- Output Markdown has reasonable content (open the artifact zip and check).
+- If `parser_failed` fires, look at `extra.error` — most common cause is
+  argv schema drift; tweak `output_args` in the config and retry.
+---
+## What "deployment ready" means after this checklist
+If smokes 1–3 pass on a fresh duplicated Space, the project is genuinely
+deployable for the Docling + PyMuPDF + Qwen2.5-VL-3B repair stack. Smokes 4
+and 5 are nice-to-have — the per-parser ablation works locally too, and
+external parsers stay flagged "experimental" until you actively need them.
+Open the `parsed_document.json` from each smoke, copy the `quality_score`,
+`mean_layout_f1` (where applicable), and any §29-relevant metric into
+`README.md` under a new "Production benchmark numbers" section. That
+publishes evidence that the success criteria are met against real data.

examples/parse_folder.py ADDED Viewed

	@@ -0,0 +1,27 @@

+"""Parse a folder sequentially."""
+from __future__ import annotations
+import argparse
+from pathlib import Path
+from zsgdp import parse_document
+def main() -> int:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("input")
+    parser.add_argument("output")
+    args = parser.parse_args()
+    input_dir = Path(args.input)
+    output_dir = Path(args.output)
+    output_dir.mkdir(parents=True, exist_ok=True)
+    for path in sorted(item for item in input_dir.iterdir() if item.is_file()):
+        parsed = parse_document(path, output_dir / path.stem)
+        print(f"{path.name}: score={parsed.quality_report.score:.2f}")
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

examples/parse_pdf.py ADDED Viewed

	@@ -0,0 +1,25 @@

+"""Parse one PDF with the MVP pipeline."""
+from __future__ import annotations
+import argparse
+from zsgdp import parse_document
+def main() -> int:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("input")
+    parser.add_argument("output")
+    args = parser.parse_args()
+    parsed = parse_document(args.input, args.output)
+    print(
+        f"score={parsed.quality_report.score:.2f} "
+        f"elements={len(parsed.elements)} tables={len(parsed.tables)} "
+        f"figures={len(parsed.figures)} chunks={len(parsed.chunks)}"
+    )
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

examples/run_benchmark.py ADDED Viewed

	@@ -0,0 +1,33 @@

+"""Minimal benchmark runner placeholder."""
+from __future__ import annotations
+import argparse
+from pathlib import Path
+from time import perf_counter
+from zsgdp import parse_document
+from zsgdp.benchmarks.throughput import pages_per_second
+def main() -> int:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("input")
+    parser.add_argument("output")
+    args = parser.parse_args()
+    input_dir = Path(args.input)
+    output_dir = Path(args.output)
+    output_dir.mkdir(parents=True, exist_ok=True)
+    total_pages = 0
+    started = perf_counter()
+    for path in sorted(item for item in input_dir.iterdir() if item.is_file()):
+        parsed = parse_document(path, output_dir / path.stem)
+        total_pages += len(parsed.pages)
+    elapsed = perf_counter() - started
+    print(f"pages={total_pages} seconds={elapsed:.2f} pages_per_second={pages_per_second(total_pages, elapsed):.2f}")
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

pyproject.toml ADDED Viewed

	@@ -0,0 +1,41 @@

+[build-system]
+requires = ["setuptools>=68", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "zero-shot-gpu-doc-parser"
+version = "0.1.0"
+description = "Zero-shot GPU document parsing and agentic chunking control plane."
+readme = "README.md"
+requires-python = ">=3.11"
+license = { text = "MIT" }
+authors = [{ name = "Zero-Shot GPU Document Parser Contributors" }]
+dependencies = []
+[project.optional-dependencies]
+pdf = ["pymupdf>=1.24.0,<1.28.0"]
+yaml = ["pyyaml>=6.0.1,<7.0.0"]
+docling = ["docling>=2.0.0,<3.0.0"]
+# `spaces` mirrors requirements.txt at the root, which is what HF Spaces
+# installs verbatim. Keep these two in sync; torch is intentionally absent
+# because the l4x1 Space image preinstalls a CUDA-matched build.
+spaces = [
+    "gradio>=4.44.0,<7.0.0",
+    "pymupdf>=1.24.0,<1.28.0",
+    "pyyaml>=6.0.1,<7.0.0",
+    "docling>=2.0.0,<3.0.0",
+]
+embedding = ["sentence-transformers>=3.0.0,<4.0.0", "transformers>=4.45.0,<6.0.0"]
+gpu_repair = ["transformers>=4.45.0,<6.0.0"]
+dev = ["pytest>=8.0.0"]
+[project.scripts]
+zsgdp = "zsgdp.cli:main"
+[tool.setuptools.packages.find]
+where = ["."]
+include = ["zsgdp*"]
+[tool.pytest.ini_options]
+testpaths = ["tests"]
+pythonpath = ["."]

requirements.txt ADDED Viewed

	@@ -0,0 +1,33 @@

+# Hugging Face Spaces dependencies for zeroshotGPU.
+#
+# Versions are pinned to tested upper bounds within each major. Bump these
+# when you have run `python -m unittest discover` and the benchmark suite
+# successfully against a new release.
+#
+# Torch is intentionally NOT pinned here. The l4x1 Space image preinstalls a
+# CUDA-matched torch build; pinning torch in this file overrides it and risks
+# a runtime/driver mismatch. If you're running locally without the Space
+# preinstall, install torch separately via the recommended channel for your
+# platform (e.g. `pip install torch --index-url https://download.pytorch.org/whl/cu121`).
+gradio>=4.44.0,<7.0.0
+pymupdf>=1.24.0,<1.28.0
+pyyaml>=6.0.1,<7.0.0
+docling>=2.0.0,<3.0.0
+# Optional GPU/embedding stack. Uncomment to enable the embedding retriever
+# (benchmarks.retriever.backend=embedding) and live GPU repair escalations
+# (repair.execute_gpu_escalations=true). Both are off by default.
+#
+# transformers>=4.45.0,<6.0.0
+# sentence-transformers>=3.0.0,<4.0.0
+# Optional external parser CLIs. Each adds a non-trivial install footprint;
+# enable only the ones the Space hardware can support. Adapter shells out to
+# the CLI binary (see zsgdp/parsers/external.py); these adapters have not
+# been smoke-tested against a live install — verify the argv schema before
+# enabling in production.
+#
+# marker-pdf>=1.0.0
+# mineru
+# unstructured>=0.15.0

scripts/__init__.py ADDED Viewed

File without changes

scripts/run_space_smoke.py ADDED Viewed

	@@ -0,0 +1,455 @@

+"""Space-side smoke validation runner.
+Automates the smokes documented in docs/space_smoke.md so a Space operator
+can run one command and get a JSON report of which smokes passed, which
+were skipped (missing deps), and which failed (with diagnostic context).
+Usage:
+    # Run all smokes that have their deps installed:
+    python -m scripts.run_space_smoke --output ./space_smoke_report.json
+    # Run only a subset:
+    python -m scripts.run_space_smoke --smoke lexical --smoke ablation
+    # Force-fail on skipped smokes (CI-style strict mode):
+    python -m scripts.run_space_smoke --strict
+The runner does NOT install missing dependencies — that's deliberately the
+operator's job (each smoke's deps add Space build time and download cost).
+A skipped smoke prints the exact `pip install` line you'd need.
+Smokes mirror docs/space_smoke.md:
+  lexical    - model-free benchmark on a synthetic markdown corpus
+  ablation   - per-parser ablation runner (text vs pymupdf)
+  embedding  - sentence-transformers / jina-embeddings-v3 retrieval
+  gpu_repair - live Qwen2.5-VL invocation against a malformed table
+  marker     - shell out to marker_single on a small PDF (if installed)
+"""
+from __future__ import annotations
+import argparse
+import importlib.util
+import json
+import shutil
+import subprocess
+import sys
+import tempfile
+import time
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any, Callable
+REPO_ROOT = Path(__file__).resolve().parents[1]
+@dataclass(slots=True)
+class SmokeResult:
+    name: str
+    status: str  # "pass" | "fail" | "skip" | "error"
+    elapsed_seconds: float = 0.0
+    detail: dict[str, Any] = field(default_factory=dict)
+    skip_reason: str = ""
+    install_hint: str = ""
+@dataclass(slots=True)
+class SmokeReport:
+    smokes: list[SmokeResult] = field(default_factory=list)
+    @property
+    def passed(self) -> bool:
+        return all(item.status in {"pass", "skip"} for item in self.smokes)
+    def to_dict(self) -> dict[str, Any]:
+        return {
+            "smokes": [
+                {
+                    "name": item.name,
+                    "status": item.status,
+                    "elapsed_seconds": round(item.elapsed_seconds, 3),
+                    "detail": item.detail,
+                    "skip_reason": item.skip_reason,
+                    "install_hint": item.install_hint,
+                }
+                for item in self.smokes
+            ],
+            "summary": {
+                "total": len(self.smokes),
+                "passed": sum(1 for item in self.smokes if item.status == "pass"),
+                "failed": sum(1 for item in self.smokes if item.status == "fail"),
+                "errored": sum(1 for item in self.smokes if item.status == "error"),
+                "skipped": sum(1 for item in self.smokes if item.status == "skip"),
+            },
+        }
+# --- Individual smokes -------------------------------------------------------
+def _make_distinctive_corpus(root: Path) -> Path:
+    """Build a small corpus with three sentences distinct enough that the
+    synthetic-QA generator picks one query per chunk."""
+    src = root / "in"
+    src.mkdir()
+    (src / "doc.md").write_text(
+        "# Sample Doc\n\n"
+        "Apples grow on trees in the orchard during autumn harvest season.\n\n"
+        "Submarines navigate beneath the ocean using sonar pulses across waters.\n\n"
+        "Mountains rise above the clouds in the distant horizon line.\n",
+        encoding="utf-8",
+    )
+    return src
+def smoke_lexical() -> SmokeResult:
+    started = time.perf_counter()
+    from zsgdp.benchmarks.parser_quality import run_parser_benchmark
+    with tempfile.TemporaryDirectory() as tmp:
+        tmp_path = Path(tmp)
+        src = _make_distinctive_corpus(tmp_path)
+        out = tmp_path / "out"
+        try:
+            summary = run_parser_benchmark(src, out, dataset_name="custom_folder")
+        except Exception as exc:
+            return SmokeResult(
+                name="lexical",
+                status="error",
+                elapsed_seconds=time.perf_counter() - started,
+                detail={"exception": str(exc)},
+            )
+    quality = float(summary.get("mean_quality_score", 0.0))
+    recall = float(summary.get("mean_retrieval_recall_at_1", 0.0))
+    passed = quality >= 0.85 and recall >= 0.7
+    return SmokeResult(
+        name="lexical",
+        status="pass" if passed else "fail",
+        elapsed_seconds=time.perf_counter() - started,
+        detail={
+            "mean_quality_score": quality,
+            "mean_retrieval_recall_at_1": recall,
+            "documents_evaluated": summary.get("document_count"),
+        },
+    )
+def smoke_ablation() -> SmokeResult:
+    started = time.perf_counter()
+    from zsgdp.benchmarks.ablation_runner import run_parser_ablations
+    with tempfile.TemporaryDirectory() as tmp:
+        tmp_path = Path(tmp)
+        src = _make_distinctive_corpus(tmp_path)
+        out = tmp_path / "out"
+        try:
+            comparison = run_parser_ablations(
+                src,
+                out,
+                parsers=["text", "pymupdf"],
+                dataset_name="custom_folder",
+            )
+        except Exception as exc:
+            return SmokeResult(
+                name="ablation",
+                status="error",
+                elapsed_seconds=time.perf_counter() - started,
+                detail={"exception": str(exc)},
+            )
+        comparison_csv_exists = (out / "ablation_comparison.csv").exists()
+    arms = [row["arm"] for row in comparison["rows"]]
+    expected_arms = {"text", "pymupdf", "merged"}
+    passed = comparison["arm_count"] == 3 and set(arms) == expected_arms and comparison_csv_exists
+    return SmokeResult(
+        name="ablation",
+        status="pass" if passed else "fail",
+        elapsed_seconds=time.perf_counter() - started,
+        detail={
+            "arm_count": comparison["arm_count"],
+            "arms": arms,
+            "comparison_csv_emitted": comparison_csv_exists,
+        },
+    )
+def smoke_embedding() -> SmokeResult:
+    started = time.perf_counter()
+    if importlib.util.find_spec("sentence_transformers") is None:
+        return SmokeResult(
+            name="embedding",
+            status="skip",
+            elapsed_seconds=time.perf_counter() - started,
+            skip_reason="sentence-transformers not installed",
+            install_hint="python -m pip install 'zero-shot-gpu-doc-parser[embedding]'",
+        )
+    from zsgdp.benchmarks.embedding_retriever import EmbeddingRetriever
+    from zsgdp.benchmarks.parser_quality import run_parser_benchmark
+    # Try to load the configured embedding model. If the load fails (no HF
+    # token, download error, OOM at import time), we report it as a skip
+    # with the exception text so the operator sees what to fix without the
+    # whole smoke run blowing up.
+    try:
+        retriever = EmbeddingRetriever()
+        retriever._ensure_embedder()  # type: ignore[attr-defined]  # private but intentional
+    except Exception as exc:
+        return SmokeResult(
+            name="embedding",
+            status="skip",
+            elapsed_seconds=time.perf_counter() - started,
+            skip_reason=f"embedding model failed to load: {exc}",
+            install_hint="Set HF_TOKEN if the model is gated, or downsize via "
+                        "benchmarks.retriever.model_id (e.g. sentence-transformers/all-MiniLM-L6-v2).",
+        )
+    config_overrides = {"benchmarks": {"retriever": {"backend": "embedding"}}}
+    with tempfile.TemporaryDirectory() as tmp:
+        tmp_path = Path(tmp)
+        src = _make_distinctive_corpus(tmp_path)
+        out = tmp_path / "out"
+        config_path = tmp_path / "config.yaml"
+        # Inline config write — keeps the smoke self-contained.
+        config_path.write_text(
+            "benchmarks:\n  retriever:\n    backend: embedding\n",
+            encoding="utf-8",
+        )
+        try:
+            summary = run_parser_benchmark(src, out, config_path=config_path, dataset_name="custom_folder")
+        except Exception as exc:
+            return SmokeResult(
+                name="embedding",
+                status="error",
+                elapsed_seconds=time.perf_counter() - started,
+                detail={"exception": str(exc)},
+            )
+    recall_5 = float(summary.get("mean_retrieval_recall_at_5", 0.0))
+    passed = recall_5 >= 0.7
+    return SmokeResult(
+        name="embedding",
+        status="pass" if passed else "fail",
+        elapsed_seconds=time.perf_counter() - started,
+        detail={
+            "mean_retrieval_recall_at_5": recall_5,
+            "mean_retrieval_recall_at_1": float(summary.get("mean_retrieval_recall_at_1", 0.0)),
+            "documents_evaluated": summary.get("document_count"),
+        },
+    )
+def smoke_gpu_repair() -> SmokeResult:
+    started = time.perf_counter()
+    if importlib.util.find_spec("transformers") is None:
+        return SmokeResult(
+            name="gpu_repair",
+            status="skip",
+            elapsed_seconds=time.perf_counter() - started,
+            skip_reason="transformers not installed",
+            install_hint="python -m pip install 'zero-shot-gpu-doc-parser[gpu_repair]'",
+        )
+    # Don't actually instantiate the transformers pipeline here — it would
+    # download multi-GB Qwen2.5-VL weights even on a dry probe. Instead, we
+    # smoke-test the wiring: a dry-run task plan, and report whether the
+    # underlying client class can be imported. Operators who want a real
+    # model invocation should use `run-gpu-tasks --execute` against a parsed
+    # output directory; the result lands in repair.gpu_escalation.results.
+    from zsgdp.gpu.transformers_client import TransformersClient
+    from zsgdp.pipeline import parse_document
+    with tempfile.TemporaryDirectory() as tmp:
+        tmp_path = Path(tmp)
+        src = tmp_path / "report.md"
+        # Malformed table (header has 2 columns; data row has 3) forces the
+        # repair loop to plan a table_vlm_repair task.
+        src.write_text(
+            "# Report\n\n| A | B |\n| --- | --- |\n| 1 | 2 | 3 |\n",
+            encoding="utf-8",
+        )
+        out = tmp_path / "out"
+        try:
+            parsed = parse_document(src, out)
+        except Exception as exc:
+            return SmokeResult(
+                name="gpu_repair",
+                status="error",
+                elapsed_seconds=time.perf_counter() - started,
+                detail={"exception": str(exc)},
+            )
+    repair = parsed.provenance.get("repair", {})
+    gpu_escalation = repair.get("gpu_escalation") or {}
+    task_count = int(gpu_escalation.get("task_count") or 0)
+    iterations = parsed.provenance.get("repair_iterations") or []
+    # We can confirm:
+    #   * Dry-run plan ran (task_count >= 1 for the malformed table)
+    #   * The repair loop iterated at least once
+    #   * The TransformersClient class is importable for live execution
+    can_execute = TransformersClient is not None
+    passed = task_count >= 1 and len(iterations) >= 1 and can_execute
+    return SmokeResult(
+        name="gpu_repair",
+        status="pass" if passed else "fail",
+        elapsed_seconds=time.perf_counter() - started,
+        detail={
+            "dry_run_task_count": task_count,
+            "repair_iterations": len(iterations),
+            "transformers_client_importable": can_execute,
+            "note": "This smoke verifies wiring only. To verify model invocation "
+                    "end-to-end, set repair.execute_gpu_escalations=true in config "
+                    "and run zsgdp run-gpu-tasks --execute against a parsed dir.",
+        },
+    )
+def smoke_marker() -> SmokeResult:
+    started = time.perf_counter()
+    if shutil.which("marker_single") is None and shutil.which("marker") is None:
+        return SmokeResult(
+            name="marker",
+            status="skip",
+            elapsed_seconds=time.perf_counter() - started,
+            skip_reason="neither `marker_single` nor `marker` found on PATH",
+            install_hint="python -m pip install marker-pdf",
+        )
+    # Marker is heavy enough that even a probe call can take 30+s on first
+    # invocation (model load). We confirm the registry adapter reports
+    # available, but don't run a full parse here — surface that as a manual
+    # follow-up via the smoke checklist.
+    from zsgdp.parsers.registry import get_parser
+    try:
+        adapter = get_parser("marker")
+    except KeyError as exc:
+        return SmokeResult(
+            name="marker",
+            status="error",
+            elapsed_seconds=time.perf_counter() - started,
+            detail={"exception": str(exc)},
+        )
+    available = bool(adapter.available())
+    return SmokeResult(
+        name="marker",
+        status="pass" if available else "fail",
+        elapsed_seconds=time.perf_counter() - started,
+        detail={
+            "adapter_reports_available": available,
+            "note": "End-to-end Marker parse is intentionally not run here "
+                    "(cold-load is heavy). See docs/space_smoke.md Smoke 5 "
+                    "for the manual upload-and-parse procedure.",
+        },
+    )
+SMOKE_REGISTRY: dict[str, Callable[[], SmokeResult]] = {
+    "lexical": smoke_lexical,
+    "ablation": smoke_ablation,
+    "embedding": smoke_embedding,
+    "gpu_repair": smoke_gpu_repair,
+    "marker": smoke_marker,
+}
+# --- Driver ------------------------------------------------------------------
+def run_smokes(names: list[str] | None = None) -> SmokeReport:
+    selected = names or list(SMOKE_REGISTRY)
+    report = SmokeReport()
+    for name in selected:
+        smoke = SMOKE_REGISTRY.get(name)
+        if smoke is None:
+            report.smokes.append(
+                SmokeResult(
+                    name=name,
+                    status="error",
+                    detail={"exception": f"unknown smoke: {name}"},
+                )
+            )
+            continue
+        try:
+            result = smoke()
+        except Exception as exc:
+            result = SmokeResult(
+                name=name,
+                status="error",
+                detail={"exception": f"{type(exc).__name__}: {exc}"},
+            )
+        report.smokes.append(result)
+    return report
+def format_text_summary(report: SmokeReport, *, strict: bool = False) -> str:
+    lines: list[str] = []
+    for item in report.smokes:
+        marker = {
+            "pass": "ok",
+            "fail": "FAIL",
+            "skip": "skip",
+            "error": "ERROR",
+        }.get(item.status, item.status.upper())
+        line = f"  [{marker}] {item.name}  ({item.elapsed_seconds:.2f}s)"
+        if item.status == "skip":
+            line += f"  reason={item.skip_reason}"
+        elif item.status == "fail":
+            line += f"  detail={json.dumps(item.detail, default=str)}"
+        elif item.status == "error":
+            line += f"  detail={json.dumps(item.detail, default=str)}"
+        lines.append(line)
+    summary = report.to_dict()["summary"]
+    overall = "PASS" if (report.passed and (not strict or summary["skipped"] == 0)) else "FAIL"
+    lines.append(
+        f"smoke: {overall}  passed={summary['passed']} failed={summary['failed']} "
+        f"errored={summary['errored']} skipped={summary['skipped']}"
+    )
+    return "\n".join(lines)
+def main(argv: list[str] | None = None) -> int:
+    parser = argparse.ArgumentParser(
+        prog="run_space_smoke",
+        description="Run zsgdp Space-side smoke validations.",
+    )
+    parser.add_argument(
+        "--smoke",
+        action="append",
+        dest="smokes",
+        choices=list(SMOKE_REGISTRY),
+        help="Smoke to run. Repeat to run multiple. Default: all registered smokes.",
+    )
+    parser.add_argument("--output", help="Optional JSON report path.")
+    parser.add_argument(
+        "--strict",
+        action="store_true",
+        help="Treat skipped smokes as failures (useful in CI when all deps must be present).",
+    )
+    args = parser.parse_args(argv)
+    report = run_smokes(args.smokes)
+    print(format_text_summary(report, strict=args.strict))
+    if args.output:
+        Path(args.output).write_text(
+            json.dumps(report.to_dict(), indent=2, ensure_ascii=False) + "\n",
+            encoding="utf-8",
+        )
+    summary = report.to_dict()["summary"]
+    if summary["failed"] or summary["errored"]:
+        return 1
+    if args.strict and summary["skipped"]:
+        return 1
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

tests/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Test package."""

tests/regression/README.md ADDED Viewed

	@@ -0,0 +1,97 @@

+# Regression fixtures
+Each fixture is a `(<name>.input.<ext>, <name>.expected.json)` pair under
+`fixtures/`. The runner in `test_regression.py` parses every input through
+`parse_document` and compares the resulting `ParsedDocument` against the
+snapshot in `<name>.expected.json` with explicit tolerances.
+## Fixture file shape
+`<name>.expected.json` has these keys (all optional except `name`):
+```json
+{
+  "name": "human-readable identifier",
+  "config": "configs/docling.yaml",
+  "selected_parsers": ["text"],
+  "tolerances": {
+    "quality_score_min": 0.85,
+    "element_count_range": [3, 6],
+    "table_count": 1,
+    "figure_count": 0,
+    "chunk_count_min": 1,
+    "blocking_failures": false,
+    "must_contain_markdown": ["# Report", "Apples grow"],
+    "must_not_contain_markdown": ["TODO", "FIXME"]
+  }
+}
+```
+Tolerance keys (all optional):
+- `quality_score_min` (float): assert `parsed.quality_report.score >= value`.
+- `quality_score_max` (float): assert `parsed.quality_report.score <= value`.
+- `element_count` (int) or `element_count_range` ([min, max]).
+- `table_count` (int) or `table_count_range`.
+- `figure_count` (int) or `figure_count_range`.
+- `chunk_count_min` (int): assert at least N chunks.
+- `chunk_count_max` (int): assert at most N chunks.
+- `blocking_failures` (bool): assert `quality_report.has_blocking_failures` matches.
+- `must_contain_markdown` (list[str]): each string must appear in
+  `parsed.to_markdown()`.
+- `must_not_contain_markdown` (list[str]): each string must NOT appear.
+- `must_contain_quality_metrics` (list[str]): each metric key must appear in
+  `quality_report.metrics`.
+- `parser_disagreement_rate_max` (float): assert disagreement <= value.
+- `repair_resolution_rate_min` (float): assert resolution >= value.
+Missing keys are not asserted (no false failures from over-specification).
+## Adding a fixture
+1. Drop the input document under `fixtures/`. PDFs, markdown, html, txt all
+   work via the standard pipeline.
+2. Run a one-off `parse_document` against it locally and inspect the output.
+3. Hand-write `<name>.expected.json` with the constraints you want to lock
+   down. Prefer ranges over exact counts where reasonable variance exists.
+4. Run `python3.11 -m unittest tests.test_regression`. It auto-discovers.
+## Performance baselines (opt-in)
+A fixture may include a `performance` block with throughput floors:
+```json
+{
+  "performance": {
+    "repeats": 2,
+    "max_elapsed_seconds": 2.0,
+    "min_pages_per_second": 0.5,
+    "always_enforce": false
+  }
+}
+```
+Keys:
+- `repeats` (int, default 2): number of warm parses to time. The median
+  elapsed is compared against the floor so a single cold-import outlier
+  does not flag.
+- `max_elapsed_seconds`: parse must finish under this in median.
+- `min_pages_per_second`: median pages/sec must meet or beat this.
+- `always_enforce` (bool, default false): when true, perf is always checked.
+Otherwise perf is gated on `ZSGDP_REGRESSION_PERF=1` so slow CI runners
+don't get noisy. Floors should be **catastrophic-regression guards** — set
+them ~50–100x slacker than your local median, not tight perf bars. The
+point is to catch "parsing a tiny markdown doc now takes 30 seconds,"
+not to track 5 % perf shifts.
+To set a baseline for a new fixture: parse it 5 times locally, take the
+median, multiply by ~10–80x for the `max_elapsed_seconds` floor.
+## When a regression fires
+The failure message points at the specific tolerance that broke. Don't blindly
+loosen the tolerance — investigate whether the regression is real first
+(parser-version bump, repair-loop drift, chunk planner change). If the new
+behavior is intentional and better, regenerate the snapshot.

tests/regression/__init__.py ADDED Viewed

File without changes

tests/regression/fixtures/markdown_basic.expected.json ADDED Viewed

	@@ -0,0 +1,31 @@

+{
+  "name": "markdown_basic",
+  "tolerances": {
+    "quality_score_min": 0.9,
+    "blocking_failures": false,
+    "element_count_range": [4, 8],
+    "table_count": 1,
+    "figure_count": 0,
+    "chunk_count_min": 4,
+    "must_contain_markdown": [
+      "# Quarterly Report",
+      "Apples grow on trees in the orchard",
+      "| Region | Q1 | Q2 |",
+      "Submarines navigate beneath the ocean"
+    ],
+    "must_not_contain_markdown": ["TODO", "FIXME"],
+    "must_contain_quality_metrics": [
+      "document_text_coverage",
+      "parser_disagreement_rate",
+      "repair_resolution_rate"
+    ],
+    "parser_disagreement_rate_max": 0.5,
+    "repair_resolution_rate_min": 0.5
+  },
+  "performance": {
+    "_comment": "Floors are catastrophic-regression guards, not tight perf bars. Median of 2 warm runs (cold-import outlier dropped) was ~6ms locally; the floor is 80x that to absorb slow CI. Enable with ZSGDP_REGRESSION_PERF=1 or set always_enforce: true.",
+    "repeats": 2,
+    "max_elapsed_seconds": 2.0,
+    "min_pages_per_second": 0.5
+  }
+}

tests/regression/fixtures/markdown_basic.input.md ADDED Viewed

	@@ -0,0 +1,14 @@

+# Quarterly Report
+Apples grow on trees in the orchard during the autumn harvest season.
+## Revenue
+| Region | Q1 | Q2 |
+| --- | --- | --- |
+| North America | 10 | 12 |
+| Europe | 8 | 9 |
+## Outlook
+Submarines navigate beneath the ocean using sonar pulses across waters.

tests/regression/test_regression.py ADDED Viewed

	@@ -0,0 +1,255 @@

+"""Snapshot regression tests against fixtures in this directory.
+Discovery: every <name>.expected.json under fixtures/ pairs with a sibling
+<name>.input.<ext>. The runner parses the input, then asserts each tolerance
+in the expected file. Tolerance keys are documented in fixtures/README.md.
+Performance baselines are opt-in per fixture via a `performance` block in
+the expected file. They run only when ZSGDP_REGRESSION_PERF=1 (or when the
+performance block has `always_enforce: true`) so a slow CI runner does not
+fail on transient noise. When enabled, the parse is run twice and the
+median elapsed time is compared against the floor.
+"""
+from __future__ import annotations
+import json
+import os
+import statistics
+import tempfile
+import time
+import unittest
+import unittest.mock
+from pathlib import Path
+from typing import Any
+from zsgdp.pipeline import parse_document
+FIXTURE_DIR = Path(__file__).parent / "fixtures"
+def _discover_fixtures() -> list[tuple[str, Path, Path]]:
+    pairs: list[tuple[str, Path, Path]] = []
+    if not FIXTURE_DIR.exists():
+        return pairs
+    for expected in sorted(FIXTURE_DIR.glob("*.expected.json")):
+        name = expected.name[: -len(".expected.json")]
+        candidates = sorted(FIXTURE_DIR.glob(f"{name}.input.*"))
+        if not candidates:
+            continue
+        pairs.append((name, candidates[0], expected))
+    return pairs
+def _check_int_or_range(actual: int, exact: Any, range_value: Any, label: str) -> str | None:
+    if exact is not None and int(exact) != actual:
+        return f"{label}: expected {exact}, got {actual}"
+    if isinstance(range_value, (list, tuple)) and len(range_value) == 2:
+        lo, hi = int(range_value[0]), int(range_value[1])
+        if not (lo <= actual <= hi):
+            return f"{label}: expected in [{lo}, {hi}], got {actual}"
+    return None
+def _evaluate(parsed, tolerances: dict[str, Any]) -> list[str]:
+    failures: list[str] = []
+    score = float(parsed.quality_report.score)
+    if "quality_score_min" in tolerances and score < float(tolerances["quality_score_min"]):
+        failures.append(f"quality_score: {score:.3f} < {tolerances['quality_score_min']}")
+    if "quality_score_max" in tolerances and score > float(tolerances["quality_score_max"]):
+        failures.append(f"quality_score: {score:.3f} > {tolerances['quality_score_max']}")
+    for label, count, exact_key, range_key in (
+        ("element_count", len(parsed.elements), "element_count", "element_count_range"),
+        ("table_count", len(parsed.tables), "table_count", "table_count_range"),
+        ("figure_count", len(parsed.figures), "figure_count", "figure_count_range"),
+    ):
+        message = _check_int_or_range(count, tolerances.get(exact_key), tolerances.get(range_key), label)
+        if message:
+            failures.append(message)
+    chunk_count = len(parsed.chunks)
+    if "chunk_count_min" in tolerances and chunk_count < int(tolerances["chunk_count_min"]):
+        failures.append(f"chunk_count: {chunk_count} < {tolerances['chunk_count_min']}")
+    if "chunk_count_max" in tolerances and chunk_count > int(tolerances["chunk_count_max"]):
+        failures.append(f"chunk_count: {chunk_count} > {tolerances['chunk_count_max']}")
+    if "blocking_failures" in tolerances:
+        actual = parsed.quality_report.has_blocking_failures
+        expected = bool(tolerances["blocking_failures"])
+        if actual != expected:
+            failures.append(f"blocking_failures: expected {expected}, got {actual}")
+    md = parsed.to_markdown()
+    for needle in tolerances.get("must_contain_markdown", []) or []:
+        if str(needle) not in md:
+            failures.append(f"must_contain_markdown: {needle!r} not found")
+    for needle in tolerances.get("must_not_contain_markdown", []) or []:
+        if str(needle) in md:
+            failures.append(f"must_not_contain_markdown: {needle!r} present")
+    metrics = parsed.quality_report.metrics
+    for key in tolerances.get("must_contain_quality_metrics", []) or []:
+        if key not in metrics:
+            failures.append(f"must_contain_quality_metrics: {key!r} missing")
+    if "parser_disagreement_rate_max" in tolerances:
+        rate = float(metrics.get("parser_disagreement_rate", 0.0))
+        if rate > float(tolerances["parser_disagreement_rate_max"]):
+            failures.append(
+                f"parser_disagreement_rate: {rate:.3f} > {tolerances['parser_disagreement_rate_max']}"
+            )
+    if "repair_resolution_rate_min" in tolerances:
+        rate = float(metrics.get("repair_resolution_rate", 1.0))
+        if rate < float(tolerances["repair_resolution_rate_min"]):
+            failures.append(
+                f"repair_resolution_rate: {rate:.3f} < {tolerances['repair_resolution_rate_min']}"
+            )
+    return failures
+def _perf_enforcement_enabled(performance: dict[str, Any]) -> bool:
+    if performance.get("always_enforce"):
+        return True
+    return os.environ.get("ZSGDP_REGRESSION_PERF", "").strip().lower() in {"1", "true", "yes"}
+def _measure_parse(input_path: Path, *, config_path: Path | None, selected_parsers, repeats: int) -> tuple[Any, list[float]]:
+    """Parse the input N times, returning (last_parsed, list_of_elapsed_seconds).
+    Uses a fresh temp output directory for each run so disk caching effects
+    are roughly equal across runs. The last parsed document is returned for
+    tolerance evaluation; per-run elapsed times feed the perf assertion.
+    """
+    elapsed: list[float] = []
+    parsed = None
+    for _ in range(max(1, repeats)):
+        with tempfile.TemporaryDirectory() as tmp:
+            started = time.perf_counter()
+            parsed = parse_document(
+                input_path,
+                Path(tmp) / "out",
+                config_path=config_path if config_path else None,
+                selected_parsers=selected_parsers,
+            )
+            elapsed.append(time.perf_counter() - started)
+    return parsed, elapsed
+def _evaluate_performance(parsed, performance: dict[str, Any], elapsed_seconds: list[float]) -> list[str]:
+    failures: list[str] = []
+    if not elapsed_seconds:
+        return failures
+    median_elapsed = statistics.median(elapsed_seconds)
+    page_count = max(len(parsed.pages), 1)
+    median_pages_per_second = page_count / median_elapsed if median_elapsed > 0 else float("inf")
+    max_elapsed = performance.get("max_elapsed_seconds")
+    if max_elapsed is not None and median_elapsed > float(max_elapsed):
+        failures.append(
+            f"performance.max_elapsed_seconds: median {median_elapsed:.2f}s > {max_elapsed}s "
+            f"(runs={len(elapsed_seconds)})"
+        )
+    min_pps = performance.get("min_pages_per_second")
+    if min_pps is not None and median_pages_per_second < float(min_pps):
+        failures.append(
+            f"performance.min_pages_per_second: median {median_pages_per_second:.2f} < {min_pps} "
+            f"(runs={len(elapsed_seconds)})"
+        )
+    return failures
+class RegressionFixturesTest(unittest.TestCase):
+    def test_regression_fixtures_match_snapshots(self):
+        fixtures = _discover_fixtures()
+        if not fixtures:
+            self.skipTest("No regression fixtures present.")
+        all_failures: list[str] = []
+        for name, input_path, expected_path in fixtures:
+            with self.subTest(fixture=name):
+                expected = json.loads(expected_path.read_text(encoding="utf-8"))
+                tolerances = expected.get("tolerances") or {}
+                performance = expected.get("performance") or {}
+                config_rel = expected.get("config")
+                config_path = Path(config_rel) if config_rel else None
+                if config_path and not config_path.is_absolute():
+                    config_path = Path(__file__).resolve().parents[2] / config_path
+                selected_parsers = expected.get("selected_parsers")
+                perf_enabled = bool(performance) and _perf_enforcement_enabled(performance)
+                repeats = int(performance.get("repeats", 2)) if perf_enabled else 1
+                parsed, elapsed = _measure_parse(
+                    input_path,
+                    config_path=config_path,
+                    selected_parsers=selected_parsers,
+                    repeats=repeats,
+                )
+                failures = _evaluate(parsed, tolerances)
+                if perf_enabled:
+                    failures.extend(_evaluate_performance(parsed, performance, elapsed))
+                if failures:
+                    all_failures.append(f"[{name}] " + "; ".join(failures))
+        if all_failures:
+            self.fail("\n".join(all_failures))
+class PerformanceEvaluatorTests(unittest.TestCase):
+    """Unit tests for the perf-evaluation helpers, separate from fixture discovery."""
+    def test_max_elapsed_floor_fires_when_too_slow(self):
+        from types import SimpleNamespace
+        parsed = SimpleNamespace(pages=[{"page_num": 1}])
+        failures = _evaluate_performance(parsed, {"max_elapsed_seconds": 0.1}, [0.5, 0.5])
+        self.assertEqual(len(failures), 1)
+        self.assertIn("max_elapsed_seconds", failures[0])
+    def test_min_pages_per_second_fires_when_too_slow(self):
+        from types import SimpleNamespace
+        parsed = SimpleNamespace(pages=[{"page_num": 1}])
+        # 1 page in 10s => 0.1 pps, floor 1.0 => fail.
+        failures = _evaluate_performance(parsed, {"min_pages_per_second": 1.0}, [10.0, 10.0])
+        self.assertEqual(len(failures), 1)
+        self.assertIn("min_pages_per_second", failures[0])
+    def test_passing_floors_yield_no_failures(self):
+        from types import SimpleNamespace
+        parsed = SimpleNamespace(pages=[{"page_num": 1}, {"page_num": 2}])
+        # 2 pages in 0.5s => 4 pps; floor 1.0 pps and max 2s.
+        failures = _evaluate_performance(
+            parsed,
+            {"max_elapsed_seconds": 2.0, "min_pages_per_second": 1.0},
+            [0.5, 0.5, 0.5],
+        )
+        self.assertEqual(failures, [])
+    def test_median_strips_cold_outlier(self):
+        from types import SimpleNamespace
+        parsed = SimpleNamespace(pages=[{"page_num": 1}])
+        # First run cold (5s), next two warm (0.1s). Median = 0.1s; floor 1s passes.
+        failures = _evaluate_performance(parsed, {"max_elapsed_seconds": 1.0}, [5.0, 0.1, 0.1])
+        self.assertEqual(failures, [])
+    def test_perf_enforcement_gating(self):
+        with unittest.mock.patch.dict("os.environ", {"ZSGDP_REGRESSION_PERF": "0"}, clear=False):
+            self.assertFalse(_perf_enforcement_enabled({"max_elapsed_seconds": 1.0}))
+            self.assertTrue(_perf_enforcement_enabled({"always_enforce": True}))
+        with unittest.mock.patch.dict("os.environ", {"ZSGDP_REGRESSION_PERF": "1"}, clear=False):
+            self.assertTrue(_perf_enforcement_enabled({"max_elapsed_seconds": 1.0}))
+if __name__ == "__main__":
+    unittest.main()

tests/test_ablation_runner.py ADDED Viewed

	@@ -0,0 +1,133 @@

+"""Tests for parser-contribution metrics and the ablation runner."""
+from __future__ import annotations
+import json
+import tempfile
+import unittest
+from pathlib import Path
+from zsgdp.benchmarks.ablation_runner import ABLATION_METRIC_KEYS, run_parser_ablations
+from zsgdp.benchmarks.parser_quality import run_parser_benchmark
+class TestParserContribution(unittest.TestCase):
+    def test_contribution_counts_appear_in_summary(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            tmp = Path(tmp)
+            src = tmp / "in"
+            src.mkdir()
+            (src / "doc.md").write_text("# Doc\n\nA paragraph.\n", encoding="utf-8")
+            summary = run_parser_benchmark(src, tmp / "out", dataset_name="custom_folder")
+            doc = summary["documents"][0]
+            self.assertIn("parser_contribution_counts", doc)
+            self.assertIn("parser_contribution_fractions", doc)
+            self.assertGreater(sum(doc["parser_contribution_counts"].values()), 0)
+            # The sum of fractions should be ~1.0 across parsers.
+            total_fraction = sum(doc["parser_contribution_fractions"].values())
+            self.assertAlmostEqual(total_fraction, 1.0, places=6)
+            top_summary = summary["parser_contribution_summary"]
+            self.assertGreater(top_summary["total"], 0)
+            self.assertEqual(set(top_summary["counts"]), set(top_summary["fractions"]))
+    def test_text_parser_dominates_markdown_doc(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            tmp = Path(tmp)
+            src = tmp / "in"
+            src.mkdir()
+            (src / "doc.md").write_text("# Doc\n\nPara one.\n\nPara two.\n", encoding="utf-8")
+            summary = run_parser_benchmark(src, tmp / "out", dataset_name="custom_folder")
+            top_counts = summary["parser_contribution_summary"]["counts"]
+            self.assertIn("text", top_counts)
+            text_count = top_counts["text"]
+            other_count = sum(value for parser, value in top_counts.items() if parser != "text")
+            self.assertGreaterEqual(text_count, other_count)
+class TestRunParserAblations(unittest.TestCase):
+    def test_two_arms_plus_merged(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            tmp = Path(tmp)
+            src = tmp / "in"
+            src.mkdir()
+            (src / "doc.md").write_text("# Doc\n\nPara one.\n\nPara two.\n", encoding="utf-8")
+            out = tmp / "out"
+            comparison = run_parser_ablations(
+                src,
+                out,
+                parsers=["text", "pymupdf"],
+                dataset_name="custom_folder",
+            )
+            self.assertEqual(comparison["arm_count"], 3)
+            arms = sorted(row["arm"] for row in comparison["rows"])
+            self.assertEqual(arms, ["merged", "pymupdf", "text"])
+            self.assertTrue((out / "arm_text").exists())
+            self.assertTrue((out / "arm_pymupdf").exists())
+            self.assertTrue((out / "arm_merged").exists())
+            self.assertTrue((out / "ablation_comparison.csv").exists())
+            self.assertTrue((out / "ablation_summary.json").exists())
+            # Each arm record carries the canonical metric keys (subset of those present).
+            for row in comparison["rows"]:
+                self.assertIn("mean_quality_score", row)
+    def test_no_merged_when_disabled(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            tmp = Path(tmp)
+            src = tmp / "in"
+            src.mkdir()
+            (src / "doc.md").write_text("# Doc\n\nPara.\n", encoding="utf-8")
+            comparison = run_parser_ablations(
+                src,
+                tmp / "out",
+                parsers=["text", "pymupdf"],
+                dataset_name="custom_folder",
+                include_merged=False,
+            )
+            self.assertEqual(comparison["arm_count"], 2)
+            self.assertNotIn("merged", {row["arm"] for row in comparison["rows"]})
+    def test_single_parser_ablation_skips_merged_arm(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            tmp = Path(tmp)
+            src = tmp / "in"
+            src.mkdir()
+            (src / "doc.md").write_text("# Doc\n\nPara.\n", encoding="utf-8")
+            comparison = run_parser_ablations(
+                src,
+                tmp / "out",
+                parsers=["text"],
+                dataset_name="custom_folder",
+            )
+            # Single parser + include_merged defaults true, but len(parsers) == 1
+            # so merged would be redundant and is skipped.
+            self.assertEqual(comparison["arm_count"], 1)
+            self.assertEqual(comparison["rows"][0]["arm"], "text")
+    def test_empty_parsers_raises(self):
+        with self.assertRaises(ValueError):
+            run_parser_ablations(".", "./out", parsers=[])
+    def test_metric_keys_constant_matches_summary_shape(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            tmp = Path(tmp)
+            src = tmp / "in"
+            src.mkdir()
+            (src / "doc.md").write_text("# Doc\n\nPara.\n", encoding="utf-8")
+            summary = run_parser_benchmark(src, tmp / "out", dataset_name="custom_folder")
+            for key in ABLATION_METRIC_KEYS:
+                self.assertIn(key, summary, f"benchmark summary missing key {key}")
+if __name__ == "__main__":
+    unittest.main()

tests/test_app.py ADDED Viewed

	@@ -0,0 +1,141 @@

+import tempfile
+import unittest
+from pathlib import Path
+from unittest.mock import patch
+try:
+    import app as space_app
+except RuntimeError as exc:
+    space_app = None
+    APP_IMPORT_ERROR = str(exc)
+else:
+    APP_IMPORT_ERROR = ""
+class _UploadedFile:
+    def __init__(self, name: str):
+        self.name = name
+class AppTests(unittest.TestCase):
+    def test_parse_uploaded_document_returns_artifact_validation(self):
+        if space_app is None:
+            self.skipTest(APP_IMPORT_ERROR)
+        with tempfile.TemporaryDirectory() as tmp:
+            input_path = Path(tmp) / "sample.md"
+            input_path.write_text("# Report\n\nHello from the Space UI.\n", encoding="utf-8")
+            outputs = space_app.parse_uploaded_document(_UploadedFile(str(input_path)), "Default lightweight")
+        self.assertEqual(len(outputs), 11)
+        summary = outputs[1]
+        artifact_validation = outputs[8]
+        archive_path = outputs[9]
+        individual_files = outputs[10]
+        self.assertTrue(summary["artifact_manifest_valid"])
+        self.assertTrue(artifact_validation["valid"])
+        self.assertTrue(Path(archive_path).exists())
+        # Per-artifact downloads.
+        self.assertIsInstance(individual_files, list)
+        self.assertGreater(len(individual_files), 0)
+        names = [Path(p).name for p in individual_files]
+        # Core artifacts every parse should produce.
+        for required in ("parsed_document.json", "document.md", "chunks.jsonl", "artifact_manifest.json"):
+            self.assertIn(required, names)
+        # Each path actually exists on disk so Gradio can serve it.
+        for path in individual_files:
+            self.assertTrue(Path(path).exists(), f"missing: {path}")
+        # The archive zip is a separate artifact and must NOT appear in the
+        # per-artifact list (zip is the bundled-everything view).
+        self.assertNotIn(Path(archive_path).name, names)
+        # Summary records the per-artifact count.
+        self.assertEqual(summary["individual_artifact_count"], len(individual_files))
+class UploadGuardTests(unittest.TestCase):
+    def test_oversized_upload_rejected_with_clear_message(self):
+        if space_app is None:
+            self.skipTest(APP_IMPORT_ERROR)
+        with tempfile.TemporaryDirectory() as tmp:
+            input_path = Path(tmp) / "huge.md"
+            input_path.write_text("# Big\n\n" + "x" * 4096, encoding="utf-8")
+            with patch.object(space_app, "MAX_UPLOAD_BYTES", 1024):
+                outputs = space_app.parse_uploaded_document(
+                    _UploadedFile(str(input_path)), "Default lightweight"
+                )
+        summary = outputs[1]
+        self.assertTrue(summary.get("rejected"))
+        self.assertIn("MB", summary["error"])
+    def test_high_page_count_rejected(self):
+        if space_app is None:
+            self.skipTest(APP_IMPORT_ERROR)
+        with tempfile.TemporaryDirectory() as tmp:
+            input_path = Path(tmp) / "doc.md"
+            input_path.write_text("# Doc\n\nSomething small.\n", encoding="utf-8")
+            class _FakeProfile:
+                page_count = 1000
+            with patch.object(space_app, "MAX_PAGE_COUNT", 50), patch.object(
+                space_app, "profile_document", return_value=_FakeProfile()
+            ):
+                outputs = space_app.parse_uploaded_document(
+                    _UploadedFile(str(input_path)), "Default lightweight"
+                )
+        summary = outputs[1]
+        self.assertTrue(summary.get("rejected"))
+        self.assertIn("pages", summary["error"])
+    def test_missing_upload_path_rejected(self):
+        if space_app is None:
+            self.skipTest(APP_IMPORT_ERROR)
+        outputs = space_app.parse_uploaded_document(
+            _UploadedFile("/tmp/zsgdp-does-not-exist.md"), "Default lightweight"
+        )
+        summary = outputs[1]
+        self.assertTrue(summary.get("rejected"))
+        self.assertIn("missing", summary["error"].lower())
+    def test_error_paths_return_full_tuple_width(self):
+        # Drift guard: every return path (success + error) must yield 11 outputs
+        # so the Gradio click handler doesn't error on shape mismatch.
+        if space_app is None:
+            self.skipTest(APP_IMPORT_ERROR)
+        # No upload at all.
+        outputs = space_app.parse_uploaded_document(None, "Default lightweight")
+        self.assertEqual(len(outputs), 11)
+        self.assertEqual(outputs[10], [])
+        # Missing-file rejection.
+        outputs = space_app.parse_uploaded_document(
+            _UploadedFile("/tmp/zsgdp-does-not-exist-xyz.md"), "Default lightweight"
+        )
+        self.assertEqual(len(outputs), 11)
+        self.assertEqual(outputs[10], [])
+    def test_normal_upload_passes_guards(self):
+        if space_app is None:
+            self.skipTest(APP_IMPORT_ERROR)
+        with tempfile.TemporaryDirectory() as tmp:
+            input_path = Path(tmp) / "ok.md"
+            input_path.write_text("# OK\n\nA normal document.\n", encoding="utf-8")
+            outputs = space_app.parse_uploaded_document(
+                _UploadedFile(str(input_path)), "Default lightweight"
+            )
+        summary = outputs[1]
+        self.assertNotIn("rejected", summary)
+if __name__ == "__main__":
+    unittest.main()

tests/test_artifacts.py ADDED Viewed

	@@ -0,0 +1,82 @@

+import json
+import tempfile
+import unittest
+from pathlib import Path
+from zsgdp.artifacts import MANIFEST_SCHEMA_VERSION, validate_artifact_manifest
+from zsgdp.cli import main
+from zsgdp.pipeline import parse_document
+from zsgdp.schema import SCHEMA_VERSION
+class ArtifactManifestTests(unittest.TestCase):
+    def test_parse_writes_valid_artifact_manifest(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            tmp_path = Path(tmp)
+            input_path = tmp_path / "sample.md"
+            output_dir = tmp_path / "out"
+            input_path.write_text("# Report\n\nHello world.\n", encoding="utf-8")
+            parsed = parse_document(input_path, output_dir)
+            manifest = json.loads((output_dir / "artifact_manifest.json").read_text(encoding="utf-8"))
+            validation = validate_artifact_manifest(output_dir)
+            self.assertEqual(manifest["doc_id"], parsed.doc_id)
+            self.assertEqual(manifest["counts"]["chunks"], len(parsed.chunks))
+            self.assertTrue(any(record["path"] == "parsed_document.json" for record in manifest["files"]))
+            self.assertTrue(validation["valid"])
+            self.assertEqual(validation["checked_count"], manifest["artifact_count"])
+    def test_manifest_records_schema_versions(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            tmp_path = Path(tmp)
+            input_path = tmp_path / "sample.md"
+            output_dir = tmp_path / "out"
+            input_path.write_text("# Report\n\nHello.\n", encoding="utf-8")
+            parsed = parse_document(input_path, output_dir)
+            manifest = json.loads((output_dir / "artifact_manifest.json").read_text(encoding="utf-8"))
+            # Manifest format version is its own integer; parsed-document
+            # schema version is a string echoed from the dataclass.
+            self.assertEqual(manifest["schema_version"], MANIFEST_SCHEMA_VERSION)
+            self.assertEqual(manifest["parsed_document_schema_version"], SCHEMA_VERSION)
+            self.assertEqual(parsed.schema_version, SCHEMA_VERSION)
+            # Validation echoes both versions so callers can gate on them.
+            validation = validate_artifact_manifest(output_dir)
+            self.assertEqual(validation["manifest_schema_version"], MANIFEST_SCHEMA_VERSION)
+            self.assertEqual(validation["parsed_document_schema_version"], SCHEMA_VERSION)
+    def test_validate_artifact_manifest_detects_checksum_mismatch(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            tmp_path = Path(tmp)
+            input_path = tmp_path / "sample.md"
+            output_dir = tmp_path / "out"
+            input_path.write_text("# Report\n\nHello world.\n", encoding="utf-8")
+            parse_document(input_path, output_dir)
+            (output_dir / "document.md").write_text("tampered\n", encoding="utf-8")
+            validation = validate_artifact_manifest(output_dir)
+            self.assertFalse(validation["valid"])
+            self.assertTrue(any("SHA-256 mismatch: document.md" == error for error in validation["errors"]))
+    def test_validate_artifacts_cli_writes_report(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            tmp_path = Path(tmp)
+            input_path = tmp_path / "sample.md"
+            output_dir = tmp_path / "out"
+            report_path = tmp_path / "validation.json"
+            input_path.write_text("# Report\n\nHello world.\n", encoding="utf-8")
+            parse_document(input_path, output_dir)
+            code = main(["validate-artifacts", "--parsed", str(output_dir), "--output", str(report_path)])
+            self.assertEqual(code, 0)
+            self.assertTrue(report_path.exists())
+            self.assertTrue(json.loads(report_path.read_text(encoding="utf-8"))["valid"])
+if __name__ == "__main__":
+    unittest.main()

tests/test_benchmark.py ADDED Viewed

	@@ -0,0 +1,55 @@

+import tempfile
+import unittest
+from pathlib import Path
+from zsgdp.benchmarks.parser_quality import run_parser_benchmark
+from zsgdp.cli import main
+class BenchmarkTests(unittest.TestCase):
+    def test_run_parser_benchmark_writes_results(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            tmp_path = Path(tmp)
+            docs = tmp_path / "docs"
+            out = tmp_path / "bench"
+            docs.mkdir()
+            (docs / "one.md").write_text("# One\n\nHello world", encoding="utf-8")
+            summary = run_parser_benchmark(docs, out)
+            self.assertEqual(summary["document_count"], 1)
+            self.assertIn("fixed_token_baseline", summary["documents"][0]["chunk_strategy_counts"])
+            self.assertTrue(summary["chunk_strategy_leaderboard"])
+            self.assertIn("structure_quality", summary)
+            self.assertIn("chunking_quality", summary)
+            self.assertIn("throughput", summary)
+            self.assertIn("ablation_plan", summary)
+            self.assertTrue((out / "results.json").exists())
+            self.assertTrue((out / "leaderboard.csv").exists())
+            self.assertTrue((out / "parser_runs.csv").exists())
+            self.assertTrue((out / "chunk_runs.csv").exists())
+            self.assertTrue((out / "structure_runs.csv").exists())
+            self.assertTrue((out / "chunk_quality.csv").exists())
+            self.assertTrue((out / "throughput_runs.csv").exists())
+            self.assertTrue((out / "ablations.json").exists())
+    def test_benchmark_cli(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            tmp_path = Path(tmp)
+            docs = tmp_path / "docs"
+            out = tmp_path / "bench"
+            docs.mkdir()
+            (docs / "one.md").write_text("# One\n\nHello world", encoding="utf-8")
+            code = main(["benchmark", "--input", str(docs), "--output", str(out), "--parsers", "text"])
+            self.assertEqual(code, 0)
+            self.assertTrue((out / "leaderboard.csv").exists())
+            self.assertTrue((out / "chunk_runs.csv").exists())
+            self.assertTrue((out / "structure_runs.csv").exists())
+            self.assertTrue((out / "chunk_quality.csv").exists())
+            self.assertTrue((out / "throughput_runs.csv").exists())
+if __name__ == "__main__":
+    unittest.main()

tests/test_chunking.py ADDED Viewed

	@@ -0,0 +1,286 @@

+import unittest
+from zsgdp.chunking import build_agentic_chunks
+from zsgdp.config import load_config
+from zsgdp.schema import DocumentProfile, Element, FigureObject, PageProfile, ParsedDocument, QualityReport, TableObject
+from zsgdp.verify import verify_chunks
+class ChunkingTests(unittest.TestCase):
+    def test_agentic_chunking_builds_parent_child_chunks(self):
+        profile = DocumentProfile(
+            doc_id="d1",
+            source_path="sample.md",
+            file_type="markdown",
+            page_count=1,
+            extension=".md",
+            pages=[PageProfile(page_num=1, digital_text_chars=120, digital_text_quality=1.0)],
+        )
+        parsed = ParsedDocument(
+            doc_id="d1",
+            source_path="sample.md",
+            file_type="markdown",
+            quality_report=QualityReport(score=0.95),
+        )
+        parsed.elements.extend(
+            [
+                Element("e1", "d1", 1, "title", markdown="# Report", reading_order=1, source_parser="text"),
+                Element("e2", "d1", 1, "paragraph", text=" ".join(["alpha"] * 80), reading_order=2, source_parser="text"),
+            ]
+        )
+        chunks = build_agentic_chunks(parsed, profile, load_config())
+        self.assertTrue(any(chunk.content_type == "parent" for chunk in chunks))
+        self.assertTrue(any(chunk.parent_chunk_id for chunk in chunks))
+        self.assertEqual(parsed.provenance["chunking"]["plan"]["target_tokens"], 512)
+    def test_chunk_readiness_adds_metrics(self):
+        profile = DocumentProfile(
+            doc_id="d1",
+            source_path="sample.md",
+            file_type="markdown",
+            page_count=1,
+            extension=".md",
+            pages=[PageProfile(page_num=1, digital_text_chars=120, digital_text_quality=1.0)],
+        )
+        parsed = ParsedDocument(
+            doc_id="d1",
+            source_path="sample.md",
+            file_type="markdown",
+            quality_report=QualityReport(score=0.95),
+        )
+        parsed.elements.append(
+            Element("e1", "d1", 1, "paragraph", text=" ".join(["alpha"] * 80), reading_order=1, source_parser="text")
+        )
+        parsed.chunks = build_agentic_chunks(parsed, profile, load_config())
+        report = verify_chunks(parsed, load_config())
+        self.assertEqual(report.metrics["chunk_count"], len(parsed.chunks))
+        self.assertIn("fixed_token_baseline", report.metrics["chunk_strategy_counts"])
+        self.assertIn("recursive_structure", report.metrics["chunk_strategy_counts"])
+    def test_fixed_token_baseline_chunks_are_emitted_with_provenance(self):
+        profile = DocumentProfile(
+            doc_id="d1",
+            source_path="sample.md",
+            file_type="markdown",
+            page_count=2,
+            extension=".md",
+            pages=[
+                PageProfile(page_num=1, digital_text_chars=120, digital_text_quality=1.0),
+                PageProfile(page_num=2, digital_text_chars=120, digital_text_quality=1.0),
+            ],
+        )
+        parsed = ParsedDocument(
+            doc_id="d1",
+            source_path="sample.md",
+            file_type="markdown",
+            quality_report=QualityReport(score=0.95),
+        )
+        parsed.elements.extend(
+            [
+                Element("e1", "d1", 1, "paragraph", text=" ".join(["alpha"] * 18), reading_order=1, source_parser="text"),
+                Element("e2", "d1", 2, "paragraph", text=" ".join(["beta"] * 18), reading_order=1, source_parser="text"),
+            ]
+        )
+        config = load_config(overrides={"chunking": {"target_tokens": 10, "overlap_ratio": 0.2}})
+        chunks = build_agentic_chunks(parsed, profile, config)
+        baseline_chunks = [chunk for chunk in chunks if chunk.strategy == "fixed_token_baseline"]
+        self.assertGreaterEqual(len(baseline_chunks), 4)
+        self.assertEqual(baseline_chunks[0].element_ids, ["e1"])
+        self.assertEqual(baseline_chunks[-1].page_end, 2)
+        self.assertEqual(parsed.provenance["chunking"]["fixed_token_baseline_count"], len(baseline_chunks))
+    def test_figure_without_caption_still_gets_visual_chunk(self):
+        profile = DocumentProfile(
+            doc_id="d1",
+            source_path="sample.pdf",
+            file_type="pdf",
+            page_count=1,
+            extension=".pdf",
+            pages=[PageProfile(page_num=1, digital_text_chars=20, digital_text_quality=1.0)],
+        )
+        parsed = ParsedDocument(
+            doc_id="d1",
+            source_path="sample.pdf",
+            file_type="pdf",
+            quality_report=QualityReport(score=0.90),
+        )
+        parsed.elements.append(Element("e1", "d1", 1, "paragraph", text="hello world", reading_order=1, source_parser="pymupdf"))
+        parsed.figures.append(
+            FigureObject(
+                figure_id="f1",
+                page_num=1,
+                image_path="/tmp/figure.png",
+                confidence=0.5,
+                source_parser="pymupdf",
+            )
+        )
+        parsed.chunks = build_agentic_chunks(parsed, profile, load_config())
+        report = verify_chunks(parsed, load_config())
+        self.assertTrue(any(chunk.figure_ids == ["f1"] for chunk in parsed.chunks))
+        self.assertEqual(report.metrics["figure_chunk_coverage"], 1.0)
+    def test_table_chunk_keeps_multimodal_metadata(self):
+        profile = DocumentProfile(
+            doc_id="d1",
+            source_path="sample.pdf",
+            file_type="pdf",
+            page_count=1,
+            extension=".pdf",
+            pages=[PageProfile(page_num=1, digital_text_chars=20, digital_text_quality=1.0)],
+        )
+        parsed = ParsedDocument(
+            doc_id="d1",
+            source_path="sample.pdf",
+            file_type="pdf",
+            quality_report=QualityReport(score=0.90),
+        )
+        parsed.elements.append(Element("e1", "d1", 1, "paragraph", text="hello world", reading_order=1, source_parser="pymupdf"))
+        parsed.tables.append(
+            TableObject(
+                table_id="t1",
+                page_nums=[1],
+                bbox=[(1.0, 2.0, 3.0, 4.0)],
+                markdown="| A | B |\n| --- | --- |\n| 1 | 2 |",
+                natural_language_rendering="Table with columns A, B. Rows: 1: B=2.",
+                confidence=0.82,
+                source_parser="pymupdf",
+                provenance={"crop_path": "/tmp/table.png", "source_parsers": ["pymupdf", "docling"]},
+            )
+        )
+        parsed.chunks = build_agentic_chunks(parsed, profile, load_config())
+        table_chunk = next(chunk for chunk in parsed.chunks if chunk.strategy == "table_object")
+        self.assertEqual(table_chunk.text, "Table with columns A, B. Rows: 1: B=2.")
+        self.assertEqual(table_chunk.metadata["markdown"], "| A | B |\n| --- | --- |\n| 1 | 2 |")
+        self.assertEqual(table_chunk.metadata["bbox"], [(1.0, 2.0, 3.0, 4.0)])
+        self.assertEqual(table_chunk.metadata["crop_path"], "/tmp/table.png")
+        self.assertEqual(table_chunk.metadata["source_parsers"], ["pymupdf", "docling"])
+    def test_vision_guided_chunking_exports_visual_regions(self):
+        profile = DocumentProfile(
+            doc_id="d1",
+            source_path="sample.pdf",
+            file_type="pdf",
+            page_count=1,
+            extension=".pdf",
+            pages=[PageProfile(page_num=1, digital_text_chars=20, digital_text_quality=1.0)],
+        )
+        parsed = ParsedDocument(
+            doc_id="d1",
+            source_path="sample.pdf",
+            file_type="pdf",
+            quality_report=QualityReport(score=0.90),
+        )
+        parsed.elements.append(Element("e1", "d1", 1, "paragraph", text="hello world", reading_order=1, source_parser="pymupdf"))
+        parsed.tables.append(TableObject(table_id="t1", page_nums=[1], bbox=[(1.0, 2.0, 3.0, 4.0)], markdown="| A | B |\n| --- | --- |\n| 1 | 2 |"))
+        parsed.figures.append(FigureObject(figure_id="f1", page_num=1, bbox=(5.0, 6.0, 7.0, 8.0), source_parser="pymupdf"))
+        config = load_config(overrides={"chunking": {"vision_guided": True}})
+        parsed.chunks = build_agentic_chunks(parsed, profile, config)
+        visual_chunks = [chunk for chunk in parsed.chunks if chunk.content_type in {"table", "figure"}]
+        self.assertTrue(all(chunk.requires_visual_context for chunk in visual_chunks))
+        self.assertEqual(len(parsed.provenance["chunking"]["vision_regions"]), 2)
+        self.assertEqual(parsed.provenance["chunking"]["vision_regions"][0]["region_id"], "t1")
+    def test_advanced_chunking_flags_emit_strategy_chunks(self):
+        profile = DocumentProfile(
+            doc_id="d1",
+            source_path="sample.pdf",
+            file_type="pdf",
+            page_count=2,
+            extension=".pdf",
+            pages=[
+                PageProfile(page_num=1, digital_text_chars=200, digital_text_quality=1.0),
+                PageProfile(page_num=2, digital_text_chars=200, digital_text_quality=1.0),
+            ],
+        )
+        parsed = ParsedDocument(
+            doc_id="d1",
+            source_path="sample.pdf",
+            file_type="pdf",
+            quality_report=QualityReport(score=0.92),
+        )
+        parsed.elements.extend(
+            [
+                Element("e1", "d1", 1, "heading", markdown="## Revenue", reading_order=1, source_parser="pymupdf"),
+                Element(
+                    "e2",
+                    "d1",
+                    1,
+                    "paragraph",
+                    text="Revenue increased by 12 percent in Q1. Gross margin improved due to pricing.",
+                    reading_order=2,
+                    source_parser="pymupdf",
+                ),
+                Element("e3", "d1", 2, "heading", markdown="## Safety", reading_order=1, source_parser="pymupdf"),
+                Element(
+                    "e4",
+                    "d1",
+                    2,
+                    "paragraph",
+                    text="Safety inspections found three unresolved risks. Corrective actions are due in June.",
+                    reading_order=2,
+                    source_parser="pymupdf",
+                ),
+            ]
+        )
+        parsed.tables.append(
+            TableObject(
+                table_id="t1",
+                page_nums=[1],
+                markdown="| Metric | Value |\n| --- | --- |\n| Revenue | 12% |",
+                natural_language_rendering="Table t1 reports revenue growth of 12 percent.",
+                source_parser="pymupdf",
+            )
+        )
+        parsed.figures.append(
+            FigureObject(
+                figure_id="f1",
+                page_num=2,
+                caption="Risk trend chart shows open safety findings.",
+                source_parser="pymupdf",
+            )
+        )
+        config = load_config(
+            overrides={
+                "chunking": {
+                    "contextual_retrieval": True,
+                    "semantic_chunking": True,
+                    "late_chunking": True,
+                    "vision_guided": True,
+                    "agentic_proposition_chunking": True,
+                }
+            }
+        )
+        parsed.chunks = build_agentic_chunks(parsed, profile, config)
+        strategies = {chunk.strategy for chunk in parsed.chunks}
+        self.assertIn("semantic", strategies)
+        self.assertIn("late", strategies)
+        self.assertIn("contextual_retrieval", strategies)
+        self.assertIn("vision_guided", strategies)
+        self.assertIn("agentic_proposition", strategies)
+        self.assertGreater(parsed.provenance["chunking"]["semantic_chunk_count"], 0)
+        self.assertGreater(parsed.provenance["chunking"]["late_chunk_count"], 0)
+        self.assertGreater(parsed.provenance["chunking"]["contextual_retrieval_chunk_count"], 0)
+        semantic_chunk = next(chunk for chunk in parsed.chunks if chunk.strategy == "semantic")
+        self.assertEqual(semantic_chunk.metadata["execution_mode"], "lexical_similarity_proxy")
+        contextual_chunk = next(chunk for chunk in parsed.chunks if chunk.strategy == "contextual_retrieval")
+        self.assertIn("source_chunk_id", contextual_chunk.metadata)
+        late_chunk = next(chunk for chunk in parsed.chunks if chunk.strategy == "late")
+        self.assertTrue(late_chunk.metadata["requires_token_level_embeddings"])
+if __name__ == "__main__":
+    unittest.main()

tests/test_cli_help.py ADDED Viewed

	@@ -0,0 +1,91 @@

+"""Tests guarding CLI help text — examples must render and stay clean."""
+from __future__ import annotations
+import io
+import unittest
+from contextlib import redirect_stdout
+from zsgdp.cli import _epilog, main
+def _capture_help(argv: list[str]) -> str:
+    """Run `zsgdp <argv> --help` and return captured stdout. SystemExit is normal."""
+    buffer = io.StringIO()
+    with redirect_stdout(buffer):
+        try:
+            main(argv + ["--help"])
+        except SystemExit:
+            pass
+    return buffer.getvalue()
+class EpilogFormatterTests(unittest.TestCase):
+    def test_epilog_dedents_indented_source_string(self):
+        rendered = _epilog(
+            """
+            zsgdp parse --input ./a --output ./b
+            zsgdp parse --input ./c --output ./d
+            """
+        )
+        # No double-indentation; first non-blank line begins with two spaces only.
+        lines = rendered.splitlines()
+        self.assertEqual(lines[0], "Examples:")
+        self.assertTrue(lines[1].startswith("  zsgdp parse"))
+        # No source-indent leak.
+        self.assertNotIn("    zsgdp", rendered)
+    def test_epilog_preserves_blank_lines_as_separators(self):
+        rendered = _epilog(
+            """
+            line one
+            line two
+            """
+        )
+        self.assertIn("\n\n", rendered)
+class SubcommandHelpTests(unittest.TestCase):
+    def test_top_level_help_lists_examples_section(self):
+        text = _capture_help([])
+        self.assertIn("Examples:", text)
+        self.assertIn("zsgdp parse", text)
+        self.assertIn("docs/space_smoke.md", text)
+    def test_parse_help_has_examples(self):
+        text = _capture_help(["parse"])
+        self.assertIn("Examples:", text)
+        self.assertIn("zsgdp parse --input", text)
+        self.assertIn("--config configs/docling.yaml", text)
+    def test_benchmark_help_covers_three_dataset_modes(self):
+        text = _capture_help(["benchmark"])
+        self.assertIn("Examples:", text)
+        self.assertIn("--dataset omnidocbench", text)
+        self.assertIn("--dataset doclaynet", text)
+    def test_benchmark_ablate_shows_merged_arm_pattern(self):
+        text = _capture_help(["benchmark-ablate"])
+        self.assertIn("--parser docling --parser pymupdf", text)
+        self.assertIn("--no-merged", text)
+    def test_run_gpu_tasks_documents_dry_run_vs_execute(self):
+        text = _capture_help(["run-gpu-tasks"])
+        self.assertIn("Dry-run", text)
+        self.assertIn("--execute", text)
+    def test_combine_benchmarks_shows_label_pairing(self):
+        text = _capture_help(["combine-benchmarks"])
+        self.assertIn("--label omnidocbench", text)
+        self.assertIn("--label doclaynet", text)
+    def test_preflight_help_documents_skip_flags(self):
+        text = _capture_help(["preflight"])
+        self.assertIn("--benchmark", text)
+        self.assertIn("--skip-unit", text)
+if __name__ == "__main__":
+    unittest.main()

tests/test_conflict_detection.py ADDED Viewed

	@@ -0,0 +1,89 @@

+import tempfile
+import unittest
+from pathlib import Path
+from zsgdp.export import export_parsed_document
+from zsgdp.merge.conflict_detection import build_candidate_conflict_report, detect_candidate_conflicts
+from zsgdp.merge.merge_candidates import merge_candidates
+from zsgdp.schema import DocumentProfile, Element, PageProfile, ParseCandidate, TableObject
+class ConflictDetectionTests(unittest.TestCase):
+    def test_conflict_report_flags_reading_order_and_table_structure(self):
+        candidates = [_candidate("docling", ["Alpha", "Beta", "Gamma"], 3), _candidate("pymupdf", ["Gamma", "Beta", "Alpha"], 2)]
+        report = build_candidate_conflict_report(candidates)
+        issues = detect_candidate_conflicts(candidates)
+        conflict_types = {conflict["type"] for conflict in report["conflicts"]}
+        self.assertIn("reading_order_disagreement", conflict_types)
+        self.assertIn("table_structure_disagreement", conflict_types)
+        self.assertTrue(issues)
+        self.assertTrue(all(issue.issue_type == "parser_disagreement" for issue in issues))
+    def test_merge_stores_and_exports_conflict_report(self):
+        profile = DocumentProfile(
+            doc_id="d1",
+            source_path="sample.pdf",
+            file_type="pdf",
+            page_count=1,
+            extension=".pdf",
+            pages=[PageProfile(page_num=1, digital_text_chars=30)],
+        )
+        parsed = merge_candidates(
+            [_candidate("docling", ["Alpha", "Beta", "Gamma"], 3), _candidate("pymupdf", ["Gamma", "Beta", "Alpha"], 2)],
+            profile,
+        )
+        with tempfile.TemporaryDirectory() as tmp:
+            output_dir = Path(tmp) / "out"
+            export_parsed_document(parsed, output_dir)
+            self.assertTrue((output_dir / "conflict_report.json").exists())
+        self.assertIn("conflict_report", parsed.provenance)
+        self.assertGreater(parsed.provenance["conflict_report"]["conflict_count"], 0)
+def _candidate(parser_name: str, ordered_text: list[str], table_columns: int) -> ParseCandidate:
+    elements = [
+        Element(
+            element_id=f"{parser_name}_e{index}",
+            doc_id="d1",
+            page_num=1,
+            type="paragraph",
+            text=text,
+            reading_order=index,
+            confidence=0.8,
+            source_parser=parser_name,
+        )
+        for index, text in enumerate(ordered_text, start=1)
+    ]
+    return ParseCandidate(
+        parser_name=parser_name,
+        doc_id="d1",
+        source_path="sample.pdf",
+        file_type="pdf",
+        pages=[{"page_num": 1, "source_parser": parser_name}],
+        elements=elements,
+        tables=[
+            TableObject(
+                table_id=f"{parser_name}_t1",
+                page_nums=[1],
+                markdown=_table_markdown(table_columns),
+                confidence=0.8,
+                source_parser=parser_name,
+            )
+        ],
+        confidence=0.8,
+    )
+def _table_markdown(columns: int) -> str:
+    if columns == 3:
+        return "| Region | Q1 | Q2 |\n| --- | --- | --- |\n| NA | 10 | 12 |"
+    return "| Region | Q1 |\n| --- | --- |\n| NA | 10 |"
+if __name__ == "__main__":
+    unittest.main()

tests/test_cross_dataset.py ADDED Viewed

	@@ -0,0 +1,123 @@

+"""Tests for cross-dataset benchmark comparison."""
+from __future__ import annotations
+import json
+import tempfile
+import unittest
+from pathlib import Path
+from zsgdp.benchmarks.cross_dataset import (
+    combine_benchmark_summaries,
+    write_cross_dataset_outputs,
+)
+def _summary(dataset_name: str, *, layout_f1: float, leaderboard: list[dict] | None = None) -> dict:
+    return {
+        "dataset_name": dataset_name,
+        "dataset_root": f"/tmp/{dataset_name}",
+        "document_count": 5,
+        "mean_quality_score": 0.9,
+        "mean_layout_f1": layout_f1,
+        "mean_retrieval_recall_at_5": 0.7,
+        "mean_table_structure_score": 0.6,
+        "mean_formula_cer": 0.2,
+        "per_parser_gt_leaderboard": leaderboard or [],
+    }
+class TestCombineBenchmarkSummaries(unittest.TestCase):
+    def test_two_runs_produce_two_rows(self):
+        runs = [
+            ("docs_a", _summary("docs_a", layout_f1=0.5)),
+            ("docs_b", _summary("docs_b", layout_f1=0.8)),
+        ]
+        comparison = combine_benchmark_summaries(runs)
+        self.assertEqual(comparison["run_count"], 2)
+        self.assertEqual(comparison["labels"], ["docs_a", "docs_b"])
+        self.assertEqual([row["label"] for row in comparison["dataset_summary"]], ["docs_a", "docs_b"])
+        layouts = {row["label"]: row["mean_layout_f1"] for row in comparison["dataset_summary"]}
+        self.assertEqual(layouts, {"docs_a": 0.5, "docs_b": 0.8})
+    def test_parser_matrix_aligns_parsers_across_runs(self):
+        leaderboard_a = [
+            {"parser": "docling", "mean_layout_class_aware_f1": 0.9, "document_count": 3},
+            {"parser": "pymupdf", "mean_layout_class_aware_f1": 0.4, "document_count": 3},
+        ]
+        leaderboard_b = [
+            {"parser": "docling", "mean_layout_class_aware_f1": 0.7, "document_count": 5},
+            # marker only appears in run B.
+            {"parser": "marker", "mean_layout_class_aware_f1": 0.6, "document_count": 5},
+        ]
+        runs = [
+            ("a", _summary("a", layout_f1=0.5, leaderboard=leaderboard_a)),
+            ("b", _summary("b", layout_f1=0.7, leaderboard=leaderboard_b)),
+        ]
+        comparison = combine_benchmark_summaries(runs)
+        matrix = comparison["parser_matrix"]
+        parsers = sorted(row["parser"] for row in matrix)
+        self.assertEqual(parsers, ["docling", "marker", "pymupdf"])
+        by_parser = {row["parser"]: row for row in matrix}
+        # Docling appears in both runs.
+        self.assertEqual(by_parser["docling"]["a__mean_layout_class_aware_f1"], 0.9)
+        self.assertEqual(by_parser["docling"]["b__mean_layout_class_aware_f1"], 0.7)
+        # Marker missing in run A -> None, present in B.
+        self.assertIsNone(by_parser["marker"]["a__mean_layout_class_aware_f1"])
+        self.assertEqual(by_parser["marker"]["b__mean_layout_class_aware_f1"], 0.6)
+        # PyMuPDF missing in run B -> None.
+        self.assertIsNone(by_parser["pymupdf"]["b__mean_layout_class_aware_f1"])
+    def test_duplicate_labels_raise(self):
+        with self.assertRaises(ValueError):
+            combine_benchmark_summaries(
+                [
+                    ("same", _summary("a", layout_f1=0.5)),
+                    ("same", _summary("b", layout_f1=0.7)),
+                ]
+            )
+    def test_summary_loaded_from_path(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            tmp = Path(tmp)
+            (tmp / "results.json").write_text(json.dumps(_summary("from_path", layout_f1=0.42)))
+            comparison = combine_benchmark_summaries([("a", tmp)])
+            self.assertEqual(comparison["dataset_summary"][0]["mean_layout_f1"], 0.42)
+    def test_missing_metric_yields_none_not_zero(self):
+        # A summary missing mean_formula_cer (older code, e.g.) preserves None.
+        sparse_summary = {"dataset_name": "old_run", "document_count": 1}
+        comparison = combine_benchmark_summaries([("old", sparse_summary)])
+        row = comparison["dataset_summary"][0]
+        self.assertEqual(row["document_count"], 1)
+        self.assertIsNone(row["mean_layout_f1"])
+        self.assertIsNone(row["mean_formula_cer"])
+class TestWriteCrossDatasetOutputs(unittest.TestCase):
+    def test_writes_json_and_csvs(self):
+        leaderboard = [{"parser": "docling", "mean_layout_class_aware_f1": 0.9, "document_count": 3}]
+        comparison = combine_benchmark_summaries(
+            [("a", _summary("a", layout_f1=0.5, leaderboard=leaderboard))]
+        )
+        with tempfile.TemporaryDirectory() as tmp:
+            tmp = Path(tmp)
+            write_cross_dataset_outputs(comparison, tmp)
+            self.assertTrue((tmp / "cross_dataset_comparison.json").exists())
+            self.assertTrue((tmp / "dataset_summary.csv").exists())
+            self.assertTrue((tmp / "parser_matrix.csv").exists())
+            ds_csv = (tmp / "dataset_summary.csv").read_text()
+            self.assertIn("mean_layout_f1", ds_csv.splitlines()[0])
+            self.assertIn("a", ds_csv.splitlines()[1])
+            matrix_csv = (tmp / "parser_matrix.csv").read_text()
+            self.assertIn("a__mean_layout_class_aware_f1", matrix_csv.splitlines()[0])
+if __name__ == "__main__":
+    unittest.main()

tests/test_datasets.py ADDED Viewed

	@@ -0,0 +1,152 @@

+"""Dataset loader tests."""
+from __future__ import annotations
+import json
+import tempfile
+import unittest
+from pathlib import Path
+from zsgdp.benchmarks.datasets import (
+    DatasetDocument,
+    get_dataset_loader,
+    iter_dataset,
+    list_dataset_loaders,
+    register_dataset_loader,
+)
+class TestDatasetRegistry(unittest.TestCase):
+    def test_built_in_loaders_registered(self):
+        loaders = list_dataset_loaders()
+        self.assertIn("custom_folder", loaders)
+        self.assertIn("omnidocbench", loaders)
+        self.assertIn("doclaynet", loaders)
+    def test_custom_alias_resolves_to_custom_folder(self):
+        loader_default = get_dataset_loader("default")
+        loader_alias = get_dataset_loader("custom")
+        loader_canonical = get_dataset_loader("custom_folder")
+        self.assertIs(loader_default, loader_canonical)
+        self.assertIs(loader_alias, loader_canonical)
+    def test_unknown_loader_raises(self):
+        with self.assertRaises(KeyError):
+            get_dataset_loader("not_a_real_dataset")
+class TestCustomFolderLoader(unittest.TestCase):
+    def test_yields_files_with_no_ground_truth(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            root = Path(tmp)
+            (root / "a.md").write_text("# A\n")
+            (root / "b.md").write_text("# B\n")
+            (root / "subdir").mkdir()
+            (root / "subdir" / "ignored.md").write_text("# nope\n")
+            documents = list(iter_dataset("custom_folder", root))
+            ids = sorted(document.doc_id for document in documents)
+            self.assertEqual(ids, ["a", "b"])
+            for document in documents:
+                self.assertIsNone(document.ground_truth)
+                self.assertEqual(document.dataset_id, "custom_folder")
+                self.assertTrue(document.path.exists())
+    def test_missing_root_raises(self):
+        with self.assertRaises(FileNotFoundError):
+            list(iter_dataset("custom_folder", "/tmp/this-path-should-not-exist-zsgdp"))
+class TestOmniDocBenchLoader(unittest.TestCase):
+    def test_pairs_pdf_with_sibling_json(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            root = Path(tmp)
+            (root / "doc1.pdf").write_bytes(b"%PDF-1.4\n%%EOF\n")
+            (root / "doc1.json").write_text(json.dumps({"reading_order": ["e1", "e2"]}))
+            (root / "doc2.pdf").write_bytes(b"%PDF-1.4\n%%EOF\n")  # no GT
+            documents = list(iter_dataset("omnidocbench", root))
+        by_id = {document.doc_id: document for document in documents}
+        self.assertEqual(set(by_id), {"doc1", "doc2"})
+        self.assertIsNotNone(by_id["doc1"].ground_truth)
+        self.assertEqual(by_id["doc1"].ground_truth["reading_order"], ["e1", "e2"])
+        self.assertTrue(by_id["doc1"].metadata["has_ground_truth"])
+        self.assertIsNone(by_id["doc2"].ground_truth)
+        self.assertFalse(by_id["doc2"].metadata["has_ground_truth"])
+    def test_no_pdfs_raises(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            with self.assertRaises(FileNotFoundError):
+                list(iter_dataset("omnidocbench", tmp))
+class TestDocLayNetLoader(unittest.TestCase):
+    def test_yields_one_document_per_image_with_filtered_annotations(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            root = Path(tmp)
+            (root / "page1.png").write_bytes(b"\x89PNG\r\n\x1a\n")
+            (root / "page2.png").write_bytes(b"\x89PNG\r\n\x1a\n")
+            (root / "annotations.json").write_text(
+                json.dumps(
+                    {
+                        "images": [
+                            {"id": 1, "file_name": "page1.png", "width": 800, "height": 1100},
+                            {"id": 2, "file_name": "page2.png", "width": 800, "height": 1100},
+                        ],
+                        "annotations": [
+                            {"id": 10, "image_id": 1, "category_id": 1, "bbox": [0, 0, 100, 50]},
+                            {"id": 11, "image_id": 1, "category_id": 2, "bbox": [0, 60, 100, 50]},
+                            {"id": 12, "image_id": 2, "category_id": 1, "bbox": [0, 0, 100, 50]},
+                        ],
+                        "categories": [
+                            {"id": 1, "name": "Title"},
+                            {"id": 2, "name": "Text"},
+                        ],
+                    }
+                )
+            )
+            documents = list(iter_dataset("doclaynet", root))
+        by_id = {document.doc_id: document for document in documents}
+        self.assertEqual(set(by_id), {"page1.png", "page2.png"})
+        self.assertEqual(len(by_id["page1.png"].ground_truth["annotations"]), 2)
+        self.assertEqual(len(by_id["page2.png"].ground_truth["annotations"]), 1)
+        self.assertEqual(by_id["page1.png"].ground_truth["categories"][1]["name"], "Title")
+    def test_missing_annotations_raises(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            root = Path(tmp)
+            (root / "page1.png").write_bytes(b"\x89PNG\r\n\x1a\n")
+            with self.assertRaises(FileNotFoundError):
+                list(iter_dataset("doclaynet", root))
+class TestRegisterDatasetLoader(unittest.TestCase):
+    def test_register_and_use_custom_loader(self):
+        marker = []
+        def fake_loader(root: Path):
+            marker.append(root)
+            yield DatasetDocument(dataset_id="fake", doc_id="x", path=root)
+        register_dataset_loader("zsgdp_test_fake", fake_loader)
+        try:
+            documents = list(iter_dataset("zsgdp_test_fake", Path("/tmp/whatever")))
+        finally:
+            from zsgdp.benchmarks.datasets import _LOADERS
+            _LOADERS.pop("zsgdp_test_fake", None)
+        self.assertEqual(len(documents), 1)
+        self.assertEqual(documents[0].dataset_id, "fake")
+        self.assertEqual(marker, [Path("/tmp/whatever")])
+if __name__ == "__main__":
+    unittest.main()

tests/test_deployment.py ADDED Viewed

	@@ -0,0 +1,43 @@

+import json
+import tempfile
+import unittest
+from pathlib import Path
+from zsgdp.cli import main
+from zsgdp.deployment import check_huggingface_space
+class DeploymentReadinessTests(unittest.TestCase):
+    def test_space_check_accepts_current_project(self):
+        report = check_huggingface_space(Path.cwd())
+        self.assertTrue(report["valid"])
+        self.assertEqual(report["target"], "huggingface_spaces")
+        self.assertEqual(report["space_name"], "zeroshotGPU")
+        self.assertEqual(report["gpu_models_target"], "zeroshotGPU")
+        self.assertEqual(report["failure_count"], 0)
+        self.assertTrue(any(check["status"] == "warn" for check in report["checks"]))
+    def test_space_check_cli_writes_report(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            output_path = Path(tmp) / "space_report.json"
+            code = main(["space-check", "--root", str(Path.cwd()), "--output", str(output_path)])
+            self.assertEqual(code, 0)
+            self.assertTrue(output_path.exists())
+            self.assertTrue(json.loads(output_path.read_text(encoding="utf-8"))["valid"])
+    def test_space_check_reports_missing_files(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            root = Path(tmp)
+            report = check_huggingface_space(root)
+            self.assertFalse(report["valid"])
+            self.assertGreater(report["failure_count"], 0)
+            self.assertTrue(any(check["id"] == "required_file" and check["status"] == "fail" for check in report["checks"]))
+if __name__ == "__main__":
+    unittest.main()

tests/test_docling_parser.py ADDED Viewed

	@@ -0,0 +1,39 @@

+import unittest
+from zsgdp.parsers.docling_parser import _export_markdown, normalize_docling_markdown
+from zsgdp.schema import DocumentProfile, PageProfile
+class FakeDoclingDocument:
+    def export_to_markdown(self):
+        return "# Report\n\n| A | B |\n| --- | --- |\n| 1 | 2 |"
+class DoclingParserTests(unittest.TestCase):
+    def test_export_markdown_uses_docling_method(self):
+        self.assertEqual(_export_markdown(FakeDoclingDocument()), "# Report\n\n| A | B |\n| --- | --- |\n| 1 | 2 |")
+    def test_normalize_docling_markdown_emits_schema(self):
+        profile = DocumentProfile(
+            doc_id="d1",
+            source_path="sample.pdf",
+            file_type="pdf",
+            page_count=1,
+            extension=".pdf",
+            pages=[PageProfile(page_num=1, digital_text_chars=20)],
+        )
+        candidate = normalize_docling_markdown(
+            markdown="# Report\n\n| A | B |\n| --- | --- |\n| 1 | 2 |",
+            profile=profile,
+            source_path="sample.pdf",
+        )
+        self.assertEqual(candidate.parser_name, "docling")
+        self.assertEqual(len(candidate.elements), 2)
+        self.assertEqual(len(candidate.tables), 1)
+        self.assertEqual(candidate.pages[0]["source_parser"], "docling")
+if __name__ == "__main__":
+    unittest.main()

tests/test_embedding_retriever.py ADDED Viewed

	@@ -0,0 +1,190 @@

+"""Tests for the embedding-based retriever and the build_retriever factory."""
+from __future__ import annotations
+import math
+import tempfile
+import unittest
+from pathlib import Path
+from unittest.mock import patch
+from zsgdp.benchmarks.embedding_retriever import (
+    EmbeddingRetriever,
+    build_retriever,
+)
+from zsgdp.benchmarks.parser_quality import run_parser_benchmark
+from zsgdp.benchmarks.retrieval import LexicalRetriever, run_retrieval_for_document
+from zsgdp.schema import Chunk, ParsedDocument, QualityReport
+def _chunk(chunk_id: str, text: str) -> Chunk:
+    return Chunk(
+        chunk_id=chunk_id,
+        doc_id="d1",
+        page_start=1,
+        page_end=1,
+        section_path=[],
+        content_type="prose",
+        text=text,
+        token_count=len(text.split()),
+    )
+def _hashing_embedder(dim: int = 32):
+    """Deterministic toy embedder: tokens hashed into a fixed-dim vector.
+    Uses a process-stable hash (hashlib.md5) instead of builtins.hash(), which
+    is randomized per Python process and would make ranking non-deterministic
+    across test runs.
+    """
+    import hashlib
+    def stable_hash(token: str) -> int:
+        return int.from_bytes(hashlib.md5(token.encode("utf-8")).digest()[:8], "big")
+    def encode(texts):
+        out = []
+        for text in texts:
+            vector = [0.0] * dim
+            for token in text.lower().split():
+                vector[stable_hash(token) % dim] += 1.0
+            out.append(vector)
+        return out
+    return encode
+class TestEmbeddingRetriever(unittest.TestCase):
+    def test_finds_distinctive_chunk_with_injected_embedder(self):
+        chunks = [
+            _chunk("c1", "Apples grow on trees in the orchard."),
+            _chunk("c2", "Cars drive on highways across the country."),
+            _chunk("c3", "Boats sail on rivers and oceans."),
+        ]
+        retriever = EmbeddingRetriever(embedder=_hashing_embedder())
+        retriever.index(chunks)
+        ranking = retriever.query("apples orchard", top_k=3)
+        self.assertEqual(ranking[0], "c1")
+    def test_empty_index_returns_empty(self):
+        retriever = EmbeddingRetriever(embedder=_hashing_embedder())
+        self.assertEqual(retriever.query("anything", top_k=3), [])
+    def test_zero_norm_vector_skipped(self):
+        retriever = EmbeddingRetriever(embedder=lambda texts: [[0.0, 0.0, 0.0]] * len(texts))
+        retriever.index([_chunk("c1", "anything")])
+        # Query embedder also returns zero vector, normalization fails -> empty.
+        self.assertEqual(retriever.query("anything", top_k=3), [])
+    def test_embedder_returning_wrong_count_raises(self):
+        bad = lambda texts: [[1.0]]  # always returns one vector
+        retriever = EmbeddingRetriever(embedder=bad)
+        with self.assertRaises(RuntimeError):
+            retriever.index([_chunk("c1", "a"), _chunk("c2", "b")])
+    def test_lazy_load_path_raises_if_sentence_transformers_missing(self):
+        retriever = EmbeddingRetriever(model_id="fake/model")
+        # Force the import to fail by patching builtins.__import__.
+        import builtins
+        real_import = builtins.__import__
+        def fake_import(name, *args, **kwargs):
+            if name == "sentence_transformers":
+                raise ImportError("not installed")
+            return real_import(name, *args, **kwargs)
+        with patch("builtins.__import__", side_effect=fake_import):
+            with self.assertRaises(RuntimeError) as ctx:
+                retriever.index([_chunk("c1", "anything")])
+            self.assertIn("sentence-transformers", str(ctx.exception))
+class TestBuildRetriever(unittest.TestCase):
+    def test_default_returns_lexical(self):
+        retriever = build_retriever({})
+        self.assertIsInstance(retriever, LexicalRetriever)
+    def test_explicit_lexical_backend(self):
+        retriever = build_retriever({"benchmarks": {"retriever": {"backend": "lexical"}}})
+        self.assertIsInstance(retriever, LexicalRetriever)
+    def test_embedding_backend_uses_gpu_models_embedding_default(self):
+        config = {
+            "benchmarks": {"retriever": {"backend": "embedding"}},
+            "gpu": {"models": {"embedding": {"model_id": "custom/model", "task": "retrieval.query"}}},
+        }
+        retriever = build_retriever(config)
+        self.assertIsInstance(retriever, EmbeddingRetriever)
+        self.assertEqual(retriever._model_id, "custom/model")
+        self.assertEqual(retriever._task, "retrieval.query")
+    def test_explicit_model_id_overrides_gpu_default(self):
+        config = {
+            "benchmarks": {"retriever": {"backend": "embedding", "model_id": "explicit/model"}},
+            "gpu": {"models": {"embedding": {"model_id": "ignored/model"}}},
+        }
+        retriever = build_retriever(config)
+        self.assertEqual(retriever._model_id, "explicit/model")
+    def test_unknown_backend_raises(self):
+        with self.assertRaises(ValueError):
+            build_retriever({"benchmarks": {"retriever": {"backend": "magic"}}})
+class TestRunRetrievalWithEmbedding(unittest.TestCase):
+    def test_run_retrieval_for_document_accepts_embedding_retriever(self):
+        parsed = ParsedDocument(
+            doc_id="d1",
+            source_path="/tmp/d1.md",
+            file_type="markdown",
+            chunks=[
+                _chunk("c1", "Apples grow on trees in the orchard during autumn."),
+                _chunk("c2", "Submarines navigate beneath the ocean using sonar."),
+            ],
+            quality_report=QualityReport(),
+        )
+        retriever = EmbeddingRetriever(embedder=_hashing_embedder())
+        run = run_retrieval_for_document(parsed, retriever=retriever)
+        self.assertTrue(run["evaluated"])
+        self.assertGreater(run["query_count"], 0)
+        for result in run["results"]:
+            truth = result["truths"][0]
+            self.assertEqual(result["retrieved"][0], truth)
+class TestBenchmarkOptInToEmbeddingBackend(unittest.TestCase):
+    def test_benchmark_uses_embedding_when_config_says_so(self):
+        # Patch build_retriever to return an EmbeddingRetriever with our toy embedder
+        # so the benchmark exercises the opt-in code path without loading a real model.
+        toy = EmbeddingRetriever(embedder=_hashing_embedder())
+        with tempfile.TemporaryDirectory() as tmp:
+            tmp = Path(tmp)
+            src = tmp / "in"
+            src.mkdir()
+            (src / "doc.md").write_text(
+                "# Doc\n\n"
+                "Apples grow on trees in the orchard during autumn season.\n\n"
+                "Submarines navigate beneath the ocean using sonar pulses across waters.\n",
+                encoding="utf-8",
+            )
+            with patch("zsgdp.benchmarks.parser_quality.load_config") as load_config:
+                load_config.return_value = {
+                    "benchmarks": {"retriever": {"backend": "embedding"}},
+                }
+                with patch(
+                    "zsgdp.benchmarks.embedding_retriever.build_retriever",
+                    return_value=toy,
+                ) as build_call:
+                    summary = run_parser_benchmark(src, tmp / "out", dataset_name="custom_folder")
+            self.assertGreaterEqual(build_call.call_count, 1)
+            self.assertTrue(summary["documents"][0]["retrieval_evaluated"])
+if __name__ == "__main__":
+    unittest.main()

tests/test_env_loading.py ADDED Viewed

	@@ -0,0 +1,110 @@

+"""Tests for .env loading and HF_TOKEN resolution."""
+from __future__ import annotations
+import os
+import tempfile
+import unittest
+from pathlib import Path
+from unittest.mock import patch
+from zsgdp.config import hf_token, load_env_file
+class LoadEnvFileTests(unittest.TestCase):
+    def test_loads_simple_key_value(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            env = Path(tmp) / ".env"
+            env.write_text("HF_TOKEN=hf_test_value_123\nOTHER=foo\n", encoding="utf-8")
+            with patch.dict("os.environ", {}, clear=False):
+                os.environ.pop("HF_TOKEN", None)
+                os.environ.pop("OTHER", None)
+                applied = load_env_file(env)
+        self.assertEqual(applied["HF_TOKEN"], "hf_test_value_123")
+        self.assertEqual(applied["OTHER"], "foo")
+    def test_skips_comments_and_blank_lines(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            env = Path(tmp) / ".env"
+            env.write_text(
+                "# top comment\n\nFOO=bar\n  # indented\n\nBAZ=qux\n",
+                encoding="utf-8",
+            )
+            with patch.dict("os.environ", {}, clear=False):
+                os.environ.pop("FOO", None)
+                os.environ.pop("BAZ", None)
+                applied = load_env_file(env)
+        self.assertEqual(set(applied), {"FOO", "BAZ"})
+    def test_quoted_values_unquoted(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            env = Path(tmp) / ".env"
+            env.write_text('A="quoted value"\nB=\'single\'\nC=plain\n', encoding="utf-8")
+            with patch.dict("os.environ", {}, clear=False):
+                for key in ("A", "B", "C"):
+                    os.environ.pop(key, None)
+                applied = load_env_file(env)
+        self.assertEqual(applied["A"], "quoted value")
+        self.assertEqual(applied["B"], "single")
+        self.assertEqual(applied["C"], "plain")
+    def test_export_prefix_stripped(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            env = Path(tmp) / ".env"
+            env.write_text("export FOO=bar\n", encoding="utf-8")
+            with patch.dict("os.environ", {}, clear=False):
+                os.environ.pop("FOO", None)
+                applied = load_env_file(env)
+        self.assertEqual(applied["FOO"], "bar")
+    def test_existing_env_wins_unless_override(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            env = Path(tmp) / ".env"
+            env.write_text("FOO=from_file\n", encoding="utf-8")
+            with patch.dict("os.environ", {"FOO": "from_env"}, clear=False):
+                applied = load_env_file(env)
+                # Default behaviour: don't override.
+                self.assertNotIn("FOO", applied)
+                self.assertEqual(os.environ["FOO"], "from_env")
+                # With override=True, file wins.
+                applied = load_env_file(env, override=True)
+                self.assertEqual(applied["FOO"], "from_file")
+                self.assertEqual(os.environ["FOO"], "from_file")
+    def test_missing_file_returns_empty_no_error(self):
+        self.assertEqual(load_env_file(Path("/tmp/zsgdp_does_not_exist.env")), {})
+class HFTokenResolverTests(unittest.TestCase):
+    def test_picks_up_hf_token(self):
+        with patch.dict(
+            "os.environ",
+            {"HF_TOKEN": "primary", "HUGGING_FACE_HUB_TOKEN": "secondary"},
+            clear=False,
+        ):
+            self.assertEqual(hf_token(), "primary")
+    def test_falls_through_alternative_names(self):
+        with patch.dict("os.environ", {}, clear=True):
+            os.environ["HUGGINGFACE_TOKEN"] = "fallback"
+            self.assertEqual(hf_token(), "fallback")
+    def test_recognises_hf_access_token_alias(self):
+        with patch.dict("os.environ", {}, clear=True):
+            os.environ["HF_ACCESS_TOKEN"] = "from_alias"
+            self.assertEqual(hf_token(), "from_alias")
+    def test_returns_none_when_unset(self):
+        with patch.dict("os.environ", {}, clear=True):
+            self.assertIsNone(hf_token())
+if __name__ == "__main__":
+    unittest.main()

tests/test_external_parser_adapters.py ADDED Viewed

	@@ -0,0 +1,69 @@

+import unittest
+from unittest.mock import patch
+from zsgdp.config import load_config
+from zsgdp.normalize.normalize_unstructured import normalize_unstructured_parts
+from zsgdp.parsers.external import MinerUParser, OlmOCRParser, PaddleOCRParser
+from zsgdp.schema import DocumentProfile, PageProfile
+class ExternalParserAdapterTests(unittest.TestCase):
+    def test_command_backed_parsers_normalize_markdown(self):
+        cases = [
+            (MinerUParser, "mineru"),
+            (OlmOCRParser, "olmocr"),
+            (PaddleOCRParser, "paddleocr"),
+        ]
+        profile = _profile()
+        for parser_class, parser_name in cases:
+            with self.subTest(parser=parser_name), patch.object(parser_class, "available", return_value=True), patch(
+                "zsgdp.parsers.external.run_external_parser_to_markdown",
+                return_value="# Report\n\n| A | B |\n| --- | --- |\n| 1 | 2 |",
+            ):
+                candidate = parser_class().parse("sample.pdf", profile, load_config())
+            self.assertEqual(candidate.parser_name, parser_name)
+            self.assertEqual(candidate.elements[0].source_parser, parser_name)
+            self.assertEqual(len(candidate.tables), 1)
+            self.assertEqual(candidate.provenance["requested_pages"], [1])
+    def test_unstructured_normalizer_preserves_page_and_title_metadata(self):
+        class Metadata:
+            page_number = 2
+        class Title:
+            category = "Title"
+            metadata = Metadata()
+            def __str__(self):
+                return "Executive Summary"
+        class Narrative:
+            category = "NarrativeText"
+            metadata = Metadata()
+            def __str__(self):
+                return "The document parser keeps provenance."
+        candidate = normalize_unstructured_parts(parts=[Title(), Narrative()], profile=_profile(), source_path="sample.pdf")
+        self.assertEqual(candidate.parser_name, "unstructured")
+        self.assertEqual(candidate.elements[0].page_num, 2)
+        self.assertEqual(candidate.elements[0].type, "title")
+        self.assertEqual(candidate.elements[0].markdown, "# Executive Summary")
+def _profile():
+    return DocumentProfile(
+        doc_id="d1",
+        source_path="sample.pdf",
+        file_type="pdf",
+        page_count=1,
+        extension=".pdf",
+        pages=[PageProfile(page_num=1, digital_text_chars=20)],
+    )
+if __name__ == "__main__":
+    unittest.main()

tests/test_gpu_runner.py ADDED Viewed

	@@ -0,0 +1,185 @@

+import json
+import tempfile
+import unittest
+from pathlib import Path
+from unittest.mock import patch
+from zsgdp.cli import main
+from zsgdp.config import load_config
+from zsgdp.gpu.batching import batch_gpu_tasks
+from zsgdp.gpu.runner import dry_run_gpu_tasks, load_gpu_tasks, run_gpu_task_manifest
+from zsgdp.gpu.worker import GPUWorker
+from zsgdp.utils import write_jsonl
+class GPURunnerTests(unittest.TestCase):
+    def test_batch_gpu_tasks_groups_by_task_type_and_batch_size(self):
+        tasks = [
+            {"task_id": "a", "task_type": "figure_description", "priority": 1},
+            {"task_id": "b", "task_type": "figure_description", "priority": 2},
+            {"task_id": "c", "task_type": "table_vlm_repair", "priority": 3},
+        ]
+        batches = batch_gpu_tasks(tasks, max_batch_size=1)
+        self.assertEqual(len(batches), 3)
+        self.assertEqual(batches[0]["task_count"], 1)
+        self.assertEqual({batch["task_type"] for batch in batches}, {"figure_description", "table_vlm_repair"})
+    def test_worker_reports_missing_image_path(self):
+        worker = GPUWorker(load_config())
+        result = worker.run(
+            {
+                "task_id": "gt1",
+                "task_type": "figure_description",
+                "doc_id": "d1",
+                "page_nums": [1],
+                "image_path": "/tmp/does-not-exist.png",
+            }
+        )
+        self.assertEqual(result["status"], "blocked_missing_inputs")
+        self.assertIn("image_path", result["readiness"]["missing_inputs"])
+    def test_run_gpu_task_manifest_writes_report(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            tmp_path = Path(tmp)
+            image_path = tmp_path / "figure.png"
+            image_path.write_bytes(b"fake")
+            tasks_path = tmp_path / "gpu_tasks.jsonl"
+            report_path = tmp_path / "report.json"
+            write_jsonl(
+                tasks_path,
+                [
+                    {
+                        "task_id": "gt1",
+                        "task_type": "figure_description",
+                        "doc_id": "d1",
+                        "page_nums": [1],
+                        "image_path": str(image_path),
+                        "priority": 60,
+                    }
+                ],
+            )
+            report = run_gpu_task_manifest(tmp_path, config=load_config(), output_path=report_path)
+            self.assertEqual(report["task_count"], 1)
+            self.assertEqual(report["ready_count"], 1)
+            self.assertTrue(report_path.exists())
+            self.assertEqual(json.loads(report_path.read_text(encoding="utf-8"))["batch_count"], 1)
+    def test_dry_run_gpu_tasks_accepts_in_memory_tasks(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            image_path = Path(tmp) / "figure.png"
+            image_path.write_bytes(b"fake")
+            report = dry_run_gpu_tasks(
+                [
+                    {
+                        "task_id": "gt1",
+                        "task_type": "figure_description",
+                        "doc_id": "d1",
+                        "page_nums": [1],
+                        "image_path": str(image_path),
+                        "priority": 60,
+                    }
+                ],
+                config=load_config(),
+            )
+        self.assertEqual(report["ready_count"], 1)
+        self.assertEqual(report["blocked_count"], 0)
+    def test_execute_gpu_tasks_dispatches_transformers_client(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            image_path = Path(tmp) / "figure.png"
+            image_path.write_bytes(b"fake")
+            task = {
+                "task_id": "gt1",
+                "task_type": "figure_description",
+                "doc_id": "d1",
+                "page_nums": [1],
+                "image_path": str(image_path),
+                "priority": 60,
+                "backend": "transformers",
+                "model_role": "vlm",
+                "model_id": "local-test-model",
+            }
+            with patch("zsgdp.gpu.worker.TransformersClient") as client_class:
+                client_class.return_value.execute_task.return_value = {"status": "executed", "text": "Figure description."}
+                report = dry_run_gpu_tasks([task], config=load_config(), dry_run=False)
+        self.assertFalse(report["dry_run"])
+        self.assertEqual(report["executed_count"], 1)
+        self.assertEqual(report["failed_count"], 0)
+        self.assertEqual(report["batches"][0]["status"], "execute_complete")
+        client_class.return_value.execute_task.assert_called_once()
+    def test_load_gpu_tasks_accepts_file_path(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            tasks_path = Path(tmp) / "tasks.jsonl"
+            write_jsonl(tasks_path, [{"task_id": "gt1"}])
+            self.assertEqual(load_gpu_tasks(tasks_path)[0]["task_id"], "gt1")
+    def test_run_gpu_tasks_cli(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            tmp_path = Path(tmp)
+            tasks_path = tmp_path / "gpu_tasks.jsonl"
+            report_path = tmp_path / "report.json"
+            write_jsonl(
+                tasks_path,
+                [
+                    {
+                        "task_id": "gt1",
+                        "task_type": "figure_description",
+                        "doc_id": "d1",
+                        "page_nums": [1],
+                        "image_path": str(tmp_path / "missing.png"),
+                        "priority": 60,
+                    }
+                ],
+            )
+            code = main(["run-gpu-tasks", "--input", str(tasks_path), "--output", str(report_path)])
+            self.assertEqual(code, 0)
+            self.assertTrue(report_path.exists())
+    def test_run_gpu_tasks_cli_execute(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            tmp_path = Path(tmp)
+            image_path = tmp_path / "figure.png"
+            image_path.write_bytes(b"fake")
+            tasks_path = tmp_path / "gpu_tasks.jsonl"
+            report_path = tmp_path / "report.json"
+            write_jsonl(
+                tasks_path,
+                [
+                    {
+                        "task_id": "gt1",
+                        "task_type": "figure_description",
+                        "doc_id": "d1",
+                        "page_nums": [1],
+                        "image_path": str(image_path),
+                        "priority": 60,
+                        "backend": "transformers",
+                        "model_role": "vlm",
+                        "model_id": "local-test-model",
+                    }
+                ],
+            )
+            with patch("zsgdp.gpu.worker.TransformersClient") as client_class:
+                client_class.return_value.execute_task.return_value = {"status": "executed", "text": "done"}
+                code = main(["run-gpu-tasks", "--input", str(tasks_path), "--output", str(report_path), "--execute"])
+            self.assertEqual(code, 0)
+            self.assertEqual(json.loads(report_path.read_text(encoding="utf-8"))["executed_count"], 1)
+if __name__ == "__main__":
+    unittest.main()

tests/test_gpu_runtime.py ADDED Viewed

	@@ -0,0 +1,47 @@

+import unittest
+from unittest.mock import patch
+from zsgdp.config import load_config
+from zsgdp.gpu import GPUModelConfig, collect_gpu_runtime_status
+class GPURuntimeTests(unittest.TestCase):
+    def test_model_config_reads_gpu_section(self):
+        config = load_config(overrides={"gpu": {"backend": "vllm", "provider": "huggingface_spaces", "space_name": "zeroshotGPU", "max_batch_size": 8}})
+        model_config = GPUModelConfig.from_config(config)
+        self.assertEqual(model_config.backend, "vllm")
+        self.assertEqual(model_config.provider, "huggingface_spaces")
+        self.assertEqual(model_config.space_name, "zeroshotGPU")
+        self.assertEqual(model_config.max_batch_size, 8)
+    def test_collect_runtime_detects_space_environment(self):
+        config = load_config()
+        with patch.dict("os.environ", {"SPACE_ID": "user/zeroshotGPU", "SPACE_HARDWARE": "l4x1"}, clear=False):
+            status = collect_gpu_runtime_status(config).to_dict()
+        self.assertEqual(status["provider"], "huggingface_spaces")
+        self.assertEqual(status["space_name"], "zeroshotGPU")
+        self.assertEqual(status["gpu_models_target"], "zeroshotGPU")
+        self.assertTrue(status["running_on_huggingface_space"])
+        self.assertEqual(status["space_id"], "user/zeroshotGPU")
+        self.assertEqual(status["hardware"], "l4x1")
+        self.assertIn(status["device"], {"cpu", "cuda", "mps"})
+        self.assertIn("torch_available", status)
+        self.assertEqual(status["configured_models"]["vlm"]["model_id"], "Qwen/Qwen2.5-VL-3B-Instruct")
+        self.assertEqual(status["configured_models"]["embedding"]["model_id"], "jinaai/jina-embeddings-v3")
+    def test_collect_runtime_reports_local_note(self):
+        config = load_config()
+        with patch.dict("os.environ", {"SPACE_ID": "", "SPACE_HOST": "", "SPACE_HARDWARE": ""}, clear=False):
+            status = collect_gpu_runtime_status(config)
+        self.assertFalse(status.running_on_huggingface_space)
+        self.assertTrue(any("local run" in note for note in status.notes))
+if __name__ == "__main__":
+    unittest.main()

tests/test_gpu_tasks.py ADDED Viewed

	@@ -0,0 +1,99 @@

+import unittest
+from zsgdp.config import load_config
+from zsgdp.gpu import plan_gpu_tasks
+from zsgdp.routing import RouteDecision
+from zsgdp.routing.budgets import Budget
+from zsgdp.schema import DocumentProfile, FigureObject, PageProfile, ParsedDocument, TableObject
+class GPUTaskTests(unittest.TestCase):
+    def test_plan_gpu_tasks_includes_route_ocr_table_and_figure(self):
+        config = load_config(overrides={"chunking": {"vision_guided": True}})
+        profile = DocumentProfile(
+            doc_id="d1",
+            source_path="sample.pdf",
+            file_type="pdf",
+            page_count=1,
+            extension=".pdf",
+            pages=[
+                PageProfile(page_num=1, scanned_score=0.8, digital_text_chars=0, digital_text_quality=0.0),
+            ],
+        )
+        parsed = ParsedDocument(
+            doc_id="d1",
+            source_path="sample.pdf",
+            file_type="pdf",
+            pages=[
+                {
+                    "page_num": 1,
+                    "parser_pages": [
+                        {"rendered_page": {"image_path": "/tmp/page.png"}},
+                    ],
+                }
+            ],
+        )
+        parsed.tables.append(
+            TableObject(
+                table_id="t1",
+                page_nums=[1],
+                bbox=[(1.0, 2.0, 3.0, 4.0)],
+                markdown="| A | B |\n| --- | --- |\n| 1 | 2 |",
+                provenance={"crop_path": "/tmp/table.png"},
+            )
+        )
+        parsed.figures.append(FigureObject(figure_id="f1", page_num=1, image_path="/tmp/figure.png"))
+        routes = [
+            RouteDecision(
+                page_id=1,
+                experts=["pymupdf", "vlm_figure_repair"],
+                reason="figure-heavy page",
+                budget=Budget(),
+                labels=["figure_heavy"],
+            )
+        ]
+        tasks = plan_gpu_tasks(profile, parsed, config, routes)
+        task_types = [task["task_type"] for task in tasks]
+        self.assertIn("vlm_route_repair", task_types)
+        self.assertIn("ocr_page", task_types)
+        self.assertIn("table_vlm_repair", task_types)
+        self.assertIn("figure_description", task_types)
+        self.assertEqual(tasks[0]["task_type"], "vlm_route_repair")
+        self.assertTrue(all(task["provider"] == "huggingface_spaces" for task in tasks))
+        self.assertTrue(all(task["space_name"] == "zeroshotGPU" for task in tasks))
+        self.assertTrue(all(task["model_id"] for task in tasks))
+        self.assertEqual(_task_by_type(tasks, "ocr_page")["model_role"], "ocr")
+        self.assertEqual(_task_by_type(tasks, "table_vlm_repair")["model_role"], "table")
+        self.assertEqual(_task_by_type(tasks, "figure_description")["model_role"], "vlm")
+        self.assertEqual(_task_by_type(tasks, "figure_description")["model_id"], "Qwen/Qwen2.5-VL-3B-Instruct")
+    def test_plan_gpu_tasks_respects_max_vlm_calls(self):
+        config = load_config(overrides={"gpu": {"max_vlm_calls_per_doc": 1}, "chunking": {"vision_guided": True}})
+        profile = DocumentProfile(
+            doc_id="d1",
+            source_path="sample.pdf",
+            file_type="pdf",
+            page_count=1,
+            extension=".pdf",
+            pages=[PageProfile(page_num=1, scanned_score=0.8)],
+        )
+        parsed = ParsedDocument(doc_id="d1", source_path="sample.pdf", file_type="pdf")
+        parsed.figures.append(FigureObject(figure_id="f1", page_num=1, image_path="/tmp/figure.png"))
+        tasks = plan_gpu_tasks(profile, parsed, config)
+        self.assertEqual(len(tasks), 1)
+        self.assertEqual(tasks[0]["task_type"], "ocr_page")
+def _task_by_type(tasks, task_type):
+    for task in tasks:
+        if task["task_type"] == task_type:
+            return task
+    raise AssertionError(f"Missing task type: {task_type}")
+if __name__ == "__main__":
+    unittest.main()

tests/test_layout_f1.py ADDED Viewed

	@@ -0,0 +1,190 @@

+"""Tests for layout F1 metric and ground-truth adapters."""
+from __future__ import annotations
+import unittest
+from zsgdp.benchmarks.ground_truth import (
+    canonical_category,
+    doclaynet_layout_truths,
+    omnidocbench_layout_truths,
+    parsed_layout_predictions,
+)
+from zsgdp.schema import Element, FigureObject, ParsedDocument, QualityReport, TableObject
+from zsgdp.verify.layout_f1 import compute_layout_f1
+def _item(bbox, category="paragraph", page_num=1):
+    return {"bbox": bbox, "category": category, "page_num": page_num}
+class TestComputeLayoutF1(unittest.TestCase):
+    def test_perfect_match_yields_f1_1(self):
+        predictions = [_item((0, 0, 100, 50)), _item((0, 60, 100, 110), "table")]
+        truths = [_item((0, 0, 100, 50)), _item((0, 60, 100, 110), "table")]
+        result = compute_layout_f1(predictions, truths)
+        self.assertEqual(result["class_aware"]["f1"], 1.0)
+        self.assertEqual(result["class_agnostic"]["f1"], 1.0)
+        self.assertEqual(result["class_aware"]["tp"], 2)
+    def test_zero_match_yields_f1_0(self):
+        predictions = [_item((0, 0, 50, 50))]
+        truths = [_item((1000, 1000, 1100, 1100))]
+        result = compute_layout_f1(predictions, truths)
+        self.assertEqual(result["class_aware"]["f1"], 0.0)
+        self.assertEqual(result["class_aware"]["fp"], 1)
+        self.assertEqual(result["class_aware"]["fn"], 1)
+    def test_iou_below_threshold_misses(self):
+        # 50% overlap by area in one axis only -> IoU < 0.5
+        predictions = [_item((0, 0, 100, 100))]
+        truths = [_item((60, 0, 160, 100))]
+        result = compute_layout_f1(predictions, truths, iou_threshold=0.5)
+        self.assertEqual(result["class_aware"]["tp"], 0)
+    def test_class_aware_vs_agnostic(self):
+        # Same bbox, different category -> agnostic matches, aware doesn't.
+        predictions = [_item((0, 0, 100, 100), "paragraph")]
+        truths = [_item((0, 0, 100, 100), "title")]
+        result = compute_layout_f1(predictions, truths)
+        self.assertEqual(result["class_aware"]["tp"], 0)
+        self.assertEqual(result["class_agnostic"]["tp"], 1)
+    def test_per_category_breakdown(self):
+        predictions = [_item((0, 0, 100, 100), "title"), _item((0, 200, 100, 300), "table")]
+        truths = [_item((0, 0, 100, 100), "title")]
+        result = compute_layout_f1(predictions, truths)
+        per_category = result["per_category"]
+        self.assertEqual(per_category["title"]["tp"], 1)
+        self.assertEqual(per_category["table"]["fp"], 1)
+    def test_empty_inputs_are_vacuously_correct(self):
+        self.assertEqual(compute_layout_f1([], [])["class_aware"]["f1"], 1.0)
+    def test_predictions_only_yields_zero(self):
+        result = compute_layout_f1([_item((0, 0, 10, 10))], [])
+        self.assertEqual(result["class_aware"]["fp"], 1)
+        self.assertEqual(result["class_aware"]["f1"], 0.0)
+    def test_page_num_must_match(self):
+        predictions = [_item((0, 0, 100, 100), "table", page_num=1)]
+        truths = [_item((0, 0, 100, 100), "table", page_num=2)]
+        result = compute_layout_f1(predictions, truths)
+        self.assertEqual(result["class_aware"]["tp"], 0)
+class TestDocLayNetAdapter(unittest.TestCase):
+    def test_xywh_converted_and_categories_normalized(self):
+        ground_truth = {
+            "image": {"id": 5, "file_name": "p.png", "page_no": 5},
+            "annotations": [
+                {"image_id": 5, "category_id": 1, "bbox": [10, 20, 50, 60]},
+                {"image_id": 5, "category_id": 2, "bbox": [100, 0, 40, 30]},
+            ],
+            "categories": {1: {"name": "Title"}, 2: {"name": "Section-header"}},
+        }
+        truths = doclaynet_layout_truths(ground_truth)
+        self.assertEqual(len(truths), 2)
+        self.assertEqual(truths[0]["bbox"], (10.0, 20.0, 60.0, 80.0))
+        self.assertEqual(truths[0]["category"], "title")
+        self.assertEqual(truths[0]["page_num"], 5)
+        self.assertEqual(truths[1]["category"], "heading")
+    def test_invalid_annotations_dropped(self):
+        ground_truth = {
+            "image": {"id": 1, "file_name": "p.png"},
+            "annotations": [
+                {"image_id": 1, "category_id": 1, "bbox": [0, 0, 0, 0]},
+                {"image_id": 1, "category_id": 1},
+            ],
+            "categories": {1: {"name": "Text"}},
+        }
+        self.assertEqual(doclaynet_layout_truths(ground_truth), [])
+class TestOmniDocBenchAdapter(unittest.TestCase):
+    def test_picks_layout_dets_first(self):
+        ground_truth = {
+            "layout_dets": [
+                {"bbox": [0, 0, 100, 50], "category": "title", "page_num": 1},
+                {"bbox": [0, 100, 100, 150], "category": "Table", "page": 1},
+            ]
+        }
+        truths = omnidocbench_layout_truths(ground_truth)
+        self.assertEqual(len(truths), 2)
+        self.assertEqual(truths[0]["category"], "title")
+        self.assertEqual(truths[1]["category"], "table")
+    def test_pages_nested_records(self):
+        ground_truth = {
+            "pages": [
+                {"page_num": 1, "elements": [{"bbox": [0, 0, 10, 10], "category": "paragraph"}]},
+                {"page_num": 2, "elements": [{"bbox": [0, 0, 10, 10], "category": "table"}]},
+            ]
+        }
+        truths = omnidocbench_layout_truths(ground_truth)
+        self.assertEqual(len(truths), 2)
+        self.assertEqual(truths[0]["page_num"], 1)
+        self.assertEqual(truths[1]["page_num"], 2)
+    def test_unknown_shape_returns_empty(self):
+        self.assertEqual(omnidocbench_layout_truths({"weird": "shape"}), [])
+        self.assertEqual(omnidocbench_layout_truths(None), [])
+class TestParsedPredictions(unittest.TestCase):
+    def test_extracts_bboxes_from_elements_tables_figures(self):
+        parsed = ParsedDocument(
+            doc_id="d1",
+            source_path="/tmp/d1.pdf",
+            file_type="pdf",
+            elements=[
+                Element(
+                    element_id="e1",
+                    doc_id="d1",
+                    page_num=1,
+                    type="title",
+                    text="Title",
+                    bbox=(0.0, 0.0, 100.0, 30.0),
+                ),
+                Element(
+                    element_id="e2",
+                    doc_id="d1",
+                    page_num=1,
+                    type="paragraph",
+                    text="No bbox",
+                ),
+            ],
+            tables=[
+                TableObject(
+                    table_id="t1",
+                    page_nums=[1],
+                    bbox=[(0.0, 100.0, 200.0, 200.0)],
+                )
+            ],
+            figures=[
+                FigureObject(
+                    figure_id="f1",
+                    page_num=2,
+                    bbox=(50.0, 50.0, 150.0, 250.0),
+                )
+            ],
+            quality_report=QualityReport(),
+        )
+        predictions = parsed_layout_predictions(parsed)
+        categories = sorted(prediction["category"] for prediction in predictions)
+        self.assertEqual(categories, ["figure", "table", "title"])
+        self.assertEqual(len(predictions), 3)
+class TestCanonicalCategory(unittest.TestCase):
+    def test_canonical_mapping(self):
+        self.assertEqual(canonical_category("Picture"), "figure")
+        self.assertEqual(canonical_category("Section-header"), "heading")
+        self.assertEqual(canonical_category("Page-footer"), "footer")
+        self.assertEqual(canonical_category("formula"), "formula")
+        self.assertEqual(canonical_category("Mystery"), "mystery")
+if __name__ == "__main__":
+    unittest.main()

tests/test_logging.py ADDED Viewed

	@@ -0,0 +1,125 @@

+"""Tests for the logging configuration and structured log emission."""
+from __future__ import annotations
+import io
+import json
+import logging
+import tempfile
+import unittest
+from pathlib import Path
+from unittest.mock import patch
+from zsgdp.logging_config import configure_logging, get_logger
+from zsgdp.pipeline import parse_document
+class ConfigureLoggingTests(unittest.TestCase):
+    def setUp(self) -> None:
+        # Reset between tests so each one configures cleanly.
+        root = logging.getLogger("zsgdp")
+        for handler in list(root.handlers):
+            root.removeHandler(handler)
+    def test_idempotent_configuration(self):
+        stream = io.StringIO()
+        configure_logging(level="INFO", json_format=False, stream=stream)
+        configure_logging(level="INFO", json_format=False, stream=stream)
+        root = logging.getLogger("zsgdp")
+        # Idempotent: still exactly one handler attached.
+        self.assertEqual(len(root.handlers), 1)
+    def test_text_format_emits_human_readable(self):
+        stream = io.StringIO()
+        configure_logging(level="INFO", json_format=False, stream=stream)
+        get_logger("zsgdp.test").info("hello", extra={"doc_id": "d1"})
+        output = stream.getvalue()
+        self.assertIn("INFO", output)
+        self.assertIn("zsgdp.test", output)
+        self.assertIn("hello", output)
+    def test_json_format_emits_one_line_records(self):
+        stream = io.StringIO()
+        configure_logging(level="INFO", json_format=True, stream=stream)
+        get_logger("zsgdp.test").info("event", extra={"doc_id": "abc", "count": 3})
+        output = stream.getvalue().strip()
+        record = json.loads(output)
+        self.assertEqual(record["level"], "INFO")
+        self.assertEqual(record["logger"], "zsgdp.test")
+        self.assertEqual(record["message"], "event")
+        self.assertEqual(record["doc_id"], "abc")
+        self.assertEqual(record["count"], 3)
+    def test_default_level_is_warning(self):
+        stream = io.StringIO()
+        with patch.dict("os.environ", {"ZSGDP_LOG_LEVEL": "", "ZSGDP_LOG_JSON": ""}, clear=False):
+            configure_logging(stream=stream)
+            get_logger("zsgdp.test").info("hidden_at_default_level")
+            self.assertNotIn("hidden_at_default_level", stream.getvalue())
+            get_logger("zsgdp.test").warning("visible_at_default_level")
+            self.assertIn("visible_at_default_level", stream.getvalue())
+    def test_get_logger_namespacing(self):
+        self.assertEqual(get_logger().name, "zsgdp")
+        self.assertEqual(get_logger("zsgdp.foo").name, "zsgdp.foo")
+        # Bare names get prefixed.
+        self.assertEqual(get_logger("foo").name, "zsgdp.foo")
+class PipelineLogEmissionTests(unittest.TestCase):
+    def test_parse_emits_start_and_end_records(self):
+        # Reset handlers so assertLogs works against the named logger.
+        root = logging.getLogger("zsgdp")
+        for handler in list(root.handlers):
+            root.removeHandler(handler)
+        root.setLevel(logging.DEBUG)
+        root.propagate = True
+        with tempfile.TemporaryDirectory() as tmp:
+            input_path = Path(tmp) / "doc.md"
+            input_path.write_text("# Doc\n\nHello.\n", encoding="utf-8")
+            with self.assertLogs("zsgdp.pipeline", level="INFO") as captured:
+                parse_document(input_path, Path(tmp) / "out")
+        messages = [record.message for record in captured.records]
+        self.assertIn("parse_start", messages)
+        self.assertIn("parse_end", messages)
+        # Find a parse_end record and assert structured fields are present.
+        parse_end = next(record for record in captured.records if record.message == "parse_end")
+        self.assertTrue(hasattr(parse_end, "doc_id"))
+        self.assertTrue(hasattr(parse_end, "elapsed_seconds"))
+        self.assertTrue(hasattr(parse_end, "quality_score"))
+        self.assertTrue(hasattr(parse_end, "element_count"))
+class RepairLogEmissionTests(unittest.TestCase):
+    def test_repair_emits_iteration_record(self):
+        root = logging.getLogger("zsgdp")
+        for handler in list(root.handlers):
+            root.removeHandler(handler)
+        root.setLevel(logging.DEBUG)
+        root.propagate = True
+        with tempfile.TemporaryDirectory() as tmp:
+            input_path = Path(tmp) / "report.md"
+            # Malformed table forces a repair iteration.
+            input_path.write_text(
+                "# Report\n\n| A | B |\n| --- | --- |\n| 1 | 2 | 3 |\n",
+                encoding="utf-8",
+            )
+            with self.assertLogs("zsgdp.repair.controller", level="INFO") as captured:
+                parse_document(input_path, Path(tmp) / "out")
+        repair_records = [r for r in captured.records if r.message == "repair_iteration"]
+        self.assertGreaterEqual(len(repair_records), 1)
+        # Each record carries the iteration index.
+        for record in repair_records:
+            self.assertTrue(hasattr(record, "iteration"))
+            self.assertTrue(hasattr(record, "status"))
+if __name__ == "__main__":
+    unittest.main()

tests/test_markdown_normalizer.py ADDED Viewed

	@@ -0,0 +1,63 @@

+import unittest
+from zsgdp.normalize.markdown import markdown_to_blocks, normalize_markdown_candidate, normalize_markdown_table
+class MarkdownNormalizerTests(unittest.TestCase):
+    def test_markdown_to_blocks_preserves_pages_tables_and_images(self):
+        markdown = """# Report
+Intro paragraph.
+| Region | Q1 |
+| --- | --- |
+| NA | 10 |
+<!-- page:2 -->
+## Figure Section
+![Chart caption](chart.png)
+"""
+        candidate = normalize_markdown_candidate(
+            markdown=markdown,
+            doc_id="d1",
+            source_path="sample.md",
+            file_type="markdown",
+            parser_name="test",
+        )
+        self.assertEqual([page["page_num"] for page in candidate.pages], [1, 2])
+        self.assertEqual(len(candidate.tables), 1)
+        self.assertEqual(candidate.tables[0].page_nums, [1])
+        self.assertEqual(len(candidate.figures), 1)
+        self.assertEqual(candidate.figures[0].page_num, 2)
+        self.assertEqual(candidate.figures[0].image_path, "chart.png")
+    def test_normalize_markdown_table_repairs_separator(self):
+        table = "| A | B |\n| --- | --- |\n| 1 | 2 |"
+        self.assertEqual(normalize_markdown_table(table), "| A | B |\n| --- | --- |\n| 1 | 2 |")
+    def test_normalize_plain_aligned_table(self):
+        table = "Region    Q1    Q2\nNorth America    10    12\nEurope    8    7"
+        self.assertEqual(
+            normalize_markdown_table(table),
+            "| Region | Q1 | Q2 |\n| --- | --- | --- |\n| North America | 10 | 12 |\n| Europe | 8 | 7 |",
+        )
+    def test_markdown_to_blocks_detects_plain_aligned_table(self):
+        blocks = markdown_to_blocks("# Report\n\nRegion    Q1    Q2\nNorth America    10    12\nEurope    8    7")
+        self.assertEqual(blocks[1].block_type, "table")
+    def test_markdown_to_blocks_classifies_caption(self):
+        blocks = markdown_to_blocks("Figure 1 Revenue trend")
+        self.assertEqual(blocks[0].block_type, "caption")
+if __name__ == "__main__":
+    unittest.main()

tests/test_marker_parser.py ADDED Viewed

	@@ -0,0 +1,73 @@

+import tempfile
+import unittest
+from pathlib import Path
+from unittest.mock import patch
+from zsgdp.config import load_config
+from zsgdp.parsers.external import MarkerParser, _read_external_markdown, _read_marker_markdown, normalize_marker_markdown
+from zsgdp.schema import DocumentProfile, PageProfile
+class MarkerParserTests(unittest.TestCase):
+    def test_normalize_marker_markdown_emits_common_schema(self):
+        profile = DocumentProfile(
+            doc_id="d1",
+            source_path="sample.pdf",
+            file_type="pdf",
+            page_count=1,
+            extension=".pdf",
+            pages=[PageProfile(page_num=1, digital_text_chars=20)],
+        )
+        candidate = normalize_marker_markdown(
+            markdown="# Report\n\n| A | B |\n| --- | --- |\n| 1 | 2 |\n\n![Chart](chart.png)",
+            profile=profile,
+            source_path="sample.pdf",
+        )
+        self.assertEqual(candidate.parser_name, "marker")
+        self.assertEqual(len(candidate.tables), 1)
+        self.assertEqual(len(candidate.figures), 1)
+        self.assertEqual(candidate.pages[0]["source_parser"], "marker")
+    def test_marker_parser_runs_markdown_through_normalizer(self):
+        profile = DocumentProfile(
+            doc_id="d1",
+            source_path="sample.pdf",
+            file_type="pdf",
+            page_count=1,
+            extension=".pdf",
+            pages=[PageProfile(page_num=1, digital_text_chars=20)],
+        )
+        with patch.object(MarkerParser, "available", return_value=True), patch(
+            "zsgdp.parsers.external.run_marker_to_markdown",
+            return_value="# Report\n\nBody.",
+        ):
+            candidate = MarkerParser().parse("sample.pdf", profile, load_config())
+        self.assertEqual(candidate.parser_name, "marker")
+        self.assertEqual(candidate.elements[0].source_parser, "marker")
+        self.assertEqual(candidate.provenance["requested_pages"], [1])
+    def test_read_marker_markdown_prefers_markdown_file(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            root = Path(tmp)
+            nested = root / "sample"
+            nested.mkdir()
+            (nested / "other.md").write_text("# Other", encoding="utf-8")
+            (nested / "markdown.md").write_text("# Preferred", encoding="utf-8")
+            markdown = _read_marker_markdown(root)
+        self.assertEqual(markdown, "# Preferred")
+    def test_read_external_markdown_falls_back_to_stdout(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            markdown = _read_external_markdown(Path(tmp), parser_name="mineru", stdout="# From stdout")
+        self.assertEqual(markdown, "# From stdout")
+if __name__ == "__main__":
+    unittest.main()

tests/test_merge.py ADDED Viewed

	@@ -0,0 +1,134 @@

+import unittest
+from zsgdp.merge.dedupe import dedupe_elements, dedupe_tables
+from zsgdp.schema import Element, TableObject
+class MergeDedupeTests(unittest.TestCase):
+    def test_merges_docling_heading_with_pymupdf_bbox(self):
+        elements = [
+            Element(
+                element_id="docling_p1_e1",
+                doc_id="d1",
+                page_num=1,
+                type="heading",
+                text="## Revenue Summary",
+                markdown="## Revenue Summary",
+                reading_order=1,
+                confidence=0.88,
+                source_parser="docling",
+            ),
+            Element(
+                element_id="pymupdf_p1_e1",
+                doc_id="d1",
+                page_num=1,
+                type="paragraph",
+                text="Revenue Summary",
+                bbox=(72.0, 100.0, 200.0, 124.0),
+                reading_order=1,
+                confidence=0.86,
+                source_parser="pymupdf",
+            ),
+        ]
+        deduped = dedupe_elements(elements)
+        self.assertEqual(len(deduped), 1)
+        self.assertEqual(deduped[0].source_parser, "docling")
+        self.assertEqual(deduped[0].bbox, (72.0, 100.0, 200.0, 124.0))
+        self.assertEqual(deduped[0].provenance["bbox_source_parser"], "pymupdf")
+    def test_drops_paragraph_duplicate_of_structured_table(self):
+        elements = [
+            Element(
+                element_id="docling_p1_e1",
+                doc_id="d1",
+                page_num=1,
+                type="paragraph",
+                text="Region Q1 Q2 North America 10 12 Europe 8 7",
+                reading_order=1,
+                confidence=0.88,
+                source_parser="docling",
+            ),
+            Element(
+                element_id="pymupdf_p1_e1",
+                doc_id="d1",
+                page_num=1,
+                type="table",
+                markdown="| Region | Q1 | Q2 |\n| --- | --- | --- |\n| North America | 10 | 12 |\n| Europe | 8 | 7 |",
+                reading_order=1,
+                confidence=0.72,
+                source_parser="pymupdf",
+            ),
+        ]
+        deduped = dedupe_elements(elements)
+        self.assertEqual(len(deduped), 1)
+        self.assertEqual(deduped[0].type, "table")
+    def test_merges_duplicate_table_elements_and_keeps_better_grid(self):
+        elements = [
+            Element(
+                element_id="docling_p1_e3",
+                doc_id="d1",
+                page_num=1,
+                type="table",
+                markdown="| Region | Q1 | Q2 North America | 10 | 12 Europe | 8 | 7 |\n| --- | --- | --- | --- | --- | --- | --- |",
+                reading_order=3,
+                confidence=0.88,
+                source_parser="docling",
+            ),
+            Element(
+                element_id="pymupdf_p1_e3",
+                doc_id="d1",
+                page_num=1,
+                type="table",
+                bbox=(72.0, 144.0, 237.0, 186.0),
+                markdown="| Region | Q1 | Q2 |\n| --- | --- | --- |\n| North America | 10 | 12 |\n| Europe | 8 | 7 |",
+                reading_order=3,
+                confidence=0.72,
+                source_parser="pymupdf",
+            ),
+        ]
+        deduped = dedupe_elements(elements)
+        self.assertEqual(len(deduped), 1)
+        self.assertEqual(deduped[0].source_parser, "pymupdf")
+        self.assertEqual(deduped[0].confidence, 0.88)
+        self.assertIn("| North America | 10 | 12 |", deduped[0].markdown or "")
+        self.assertEqual(deduped[0].bbox, (72.0, 144.0, 237.0, 186.0))
+    def test_merges_duplicate_tables_and_keeps_better_grid_assets(self):
+        tables = [
+            TableObject(
+                table_id="docling_t1",
+                page_nums=[1],
+                markdown="| Region | Q1 | Q2 North America | 10 | 12 Europe | 8 | 7 |\n| --- | --- | --- | --- | --- | --- | --- |",
+                confidence=0.84,
+                source_parser="docling",
+            ),
+            TableObject(
+                table_id="pymupdf_t1",
+                page_nums=[1],
+                bbox=[(72.0, 144.0, 237.0, 186.0)],
+                markdown="| Region | Q1 | Q2 |\n| --- | --- | --- |\n| North America | 10 | 12 |\n| Europe | 8 | 7 |",
+                confidence=0.72,
+                source_parser="pymupdf",
+                provenance={"crop_path": "/tmp/table.png"},
+            ),
+        ]
+        deduped = dedupe_tables(tables)
+        self.assertEqual(len(deduped), 1)
+        self.assertEqual(deduped[0].source_parser, "pymupdf")
+        self.assertEqual(deduped[0].confidence, 0.84)
+        self.assertEqual(deduped[0].bbox, [(72.0, 144.0, 237.0, 186.0)])
+        self.assertEqual(deduped[0].provenance["crop_path"], "/tmp/table.png")
+        self.assertEqual(deduped[0].provenance["source_parsers"], ["pymupdf", "docling"])
+if __name__ == "__main__":
+    unittest.main()

tests/test_parser_disagreement.py ADDED Viewed

	@@ -0,0 +1,177 @@

+"""Tests for parser-disagreement and repair-success metrics."""
+from __future__ import annotations
+import tempfile
+import unittest
+from pathlib import Path
+from zsgdp.merge.conflict_detection import build_candidate_conflict_report
+from zsgdp.pipeline import parse_document
+from zsgdp.schema import DocumentProfile, Element, ParseCandidate, PageProfile, TableObject
+from zsgdp.verify.parser_disagreement import compute_parser_disagreement
+from zsgdp.verify.repair_success import compute_repair_success
+def _profile() -> DocumentProfile:
+    return DocumentProfile(
+        doc_id="d1",
+        source_path="/tmp/d1.md",
+        file_type="markdown",
+        page_count=1,
+        extension=".md",
+        pages=[PageProfile(page_num=1, digital_text_chars=400, digital_text_quality=0.9)],
+    )
+def _candidate(name: str, *, text: str, table_count: int = 0) -> ParseCandidate:
+    elements = [
+        Element(
+            element_id=f"{name}_e1",
+            doc_id="d1",
+            page_num=1,
+            type="paragraph",
+            text=text,
+            reading_order=1,
+            source_parser=name,
+        )
+    ]
+    tables: list[TableObject] = []
+    for index in range(table_count):
+        tables.append(
+            TableObject(
+                table_id=f"{name}_t{index + 1}",
+                page_nums=[1],
+                markdown="| A | B |\n| --- | --- |\n| 1 | 2 |",
+                source_parser=name,
+            )
+        )
+    return ParseCandidate(
+        parser_name=name,
+        doc_id="d1",
+        source_path="/tmp/d1.md",
+        file_type="markdown",
+        elements=elements,
+        tables=tables,
+        figures=[],
+        pages=[{"page_num": 1, "source_parser": name}],
+        confidence=0.8,
+    )
+class TestParserDisagreement(unittest.TestCase):
+    def test_disagreement_rate_uses_pair_count_denominator(self):
+        candidates = [
+            _candidate("docling", text="A" * 800, table_count=4),
+            _candidate("pymupdf", text="A" * 100, table_count=0),
+        ]
+        report = build_candidate_conflict_report(candidates)
+        parser_metrics = {
+            "docling": {"parser": "docling", "failed": False},
+            "pymupdf": {"parser": "pymupdf", "failed": False},
+        }
+        result = compute_parser_disagreement(report, parser_metrics)
+        self.assertEqual(result["candidate_count"], 2)
+        self.assertEqual(result["parser_pair_count"], 1)
+        self.assertGreater(result["conflict_count"], 0)
+        self.assertGreater(result["disagreement_rate"], 0.0)
+        self.assertIn("text_coverage_gap", result["disagreement_by_type"])
+        self.assertIn("docling|pymupdf", result["disagreement_by_parser_pair"])
+    def test_disagreement_rate_zero_when_single_parser(self):
+        result = compute_parser_disagreement(
+            {"conflicts": []},
+            {"docling": {"parser": "docling", "failed": False}},
+        )
+        self.assertEqual(result["candidate_count"], 1)
+        self.assertEqual(result["parser_pair_count"], 0)
+        self.assertEqual(result["disagreement_rate"], 0.0)
+    def test_failed_parsers_excluded_from_pair_count(self):
+        result = compute_parser_disagreement(
+            {"conflicts": []},
+            {
+                "docling": {"parser": "docling", "failed": False},
+                "marker": {"parser": "marker", "failed": True, "error": "boom"},
+                "pymupdf": {"parser": "pymupdf", "failed": False},
+            },
+        )
+        self.assertEqual(result["candidate_count"], 2)
+        self.assertEqual(result["parser_pair_count"], 1)
+class TestRepairSuccess(unittest.TestCase):
+    def test_resolution_rate_when_blocking_issue_resolved(self):
+        pre = {"score": 0.5, "issues": [{"issue_type": "invalid_table", "blocking": True, "page_num": 1, "region_id": "t1"}]}
+        post = {"score": 0.9, "issues": []}
+        history = [{"iteration": 1, "before_score": 0.5, "after_score": 0.9, "actions": [{"action": "repair_table"}]}]
+        result = compute_repair_success(pre, post, history)
+        self.assertEqual(result["pre_repair_blocking_count"], 1)
+        self.assertEqual(result["post_repair_blocking_count"], 0)
+        self.assertEqual(result["resolved_blocking_count"], 1)
+        self.assertEqual(result["repair_resolution_rate"], 1.0)
+        self.assertEqual(result["repair_regression_rate"], 0.0)
+        self.assertEqual(result["iteration_count"], 1)
+        self.assertAlmostEqual(result["score_delta"], 0.4, places=6)
+    def test_regression_rate_counts_new_blocking_issues(self):
+        pre = {"score": 0.7, "issues": [{"issue_type": "invalid_table", "blocking": True, "region_id": "t1"}]}
+        post = {
+            "score": 0.6,
+            "issues": [
+                {"issue_type": "invalid_table", "blocking": True, "region_id": "t1"},
+                {"issue_type": "missing_text_coverage", "blocking": True, "page_num": 2},
+            ],
+        }
+        history = [{"iteration": 1, "before_score": 0.7, "after_score": 0.6, "actions": []}]
+        result = compute_repair_success(pre, post, history)
+        self.assertEqual(result["resolved_blocking_count"], 0)
+        self.assertEqual(result["regressed_blocking_count"], 1)
+        self.assertEqual(result["repair_regression_rate"], 1.0)
+        self.assertEqual(result["repair_resolution_rate"], 0.0)
+    def test_vacuous_success_when_no_pre_repair_blocking_issues(self):
+        result = compute_repair_success(
+            {"score": 1.0, "issues": []},
+            {"score": 1.0, "issues": []},
+            [],
+        )
+        self.assertEqual(result["repair_resolution_rate"], 1.0)
+        self.assertEqual(result["repair_regression_rate"], 0.0)
+        self.assertEqual(result["iteration_count"], 0)
+class TestRepairSuccessIntegration(unittest.TestCase):
+    def test_pipeline_records_resolution_for_iterative_table_repair(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            input_path = Path(tmp) / "report.md"
+            input_path.write_text(
+                "# Report\n\n| A | B |\n| --- | --- |\n| 1 | 2 | 3 |\n",
+                encoding="utf-8",
+            )
+            parsed = parse_document(input_path, Path(tmp) / "out")
+            metrics = parsed.quality_report.metrics
+            self.assertIn("repair_resolution_rate", metrics)
+            self.assertIn("repair_regression_rate", metrics)
+            self.assertIn("parser_disagreement_rate", metrics)
+            success = parsed.provenance["repair_success"]
+            self.assertGreaterEqual(success["pre_repair_issue_count"], 1)
+            self.assertGreaterEqual(success["resolved_any_count"], 1)
+            self.assertGreaterEqual(success["repair_resolution_rate_any"], 0.0)
+            self.assertGreater(success["iteration_count"], 0)
+if __name__ == "__main__":
+    unittest.main()