Spaces:

Ma-Ri-Ba-Ku
/

Picarones

Running

App Files Files Community

Picarones / README.md

Claude

readme: anglais en haut, français en bas

0ca9244 unverified about 19 hours ago

preview code

raw

history blame contribute delete

37.2 kB

metadata

title: Picarones
emoji: 📜
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false

Picarones

OCR Post-Correction Benchmarking Platform for Existing ALTO XML Corpora

Plateforme d'évaluation de pipelines de post-correction OCR sur corpus ALTO XML

Picarones is an open-source platform designed for an institutional context — heritage services, archives, digital libraries that already have a corpus in ALTO XML (output of a prior OCR pipeline) and want to rigorously evaluate post-correction strategies: alternative re-OCR, LLM correction, specialised ALTO mappers, ensemble voting, etc.

This is a benchmarking platform, not a production workshop. Picarones loads an existing ALTO corpus, runs the pipelines the researcher brings, measures every relevant metric, and produces a self-contained HTML report that is factual and reproducible. No prescriptions, no automatic verdicts: the report shows the numbers, the researcher decides.

Heritage-specific metrics (diplomatic CER, ligature and diacritic scores, medieval abbreviations, Roman numerals, foliation, fuzzy full-text searchability, philological marker fidelity), composable pipelines, a factual narrative synthesis at the top of the report, multi-engine Friedman/Nemenyi significance tests with a critical difference diagram, cost / speed / CO₂ Pareto analysis, per-junction error absorption, multi-run stability, controlled per-slot comparison, and several corpus import sources (IIIF, HuggingFace, HTR-United, eScriptorium, ZIP upload).

Version française ci-dessous.

Use case

An institution (archive, digital library, heritage service) has already OCR'd a corpus of several thousand pages — output in ALTO XML with zone, line and word coordinates. The output has a decent but imperfect CER, with the typical defects on historical ligatures, unexpanded abbreviations and badly recognised proper names.

The institution wants to rigorously compare several post-correction strategies on that existing corpus:

alternative re-OCR (Pero OCR, Kraken, Mistral OCR…);
LLM correction (GPT-4o, Claude, Mistral) in text-only or image+text mode;
specialised ALTO mappers (line re-segmentation, abbreviation expansion, diplomatic normalisation);
composed pipelines: alternative OCR → LLM correction → ALTO mapper.

Picarones loads the ALTO corpus, runs each pipeline, measures the relevant metrics (CER gain, recovered fuzzy searchability, preserved numerical sequences, errors introduced by the post-corrector — critical for LLMs that silently modernise) and produces a factual HTML report that is directly citable in a scientific publication: every number is traceable to its source payload, no prescription imposed.

En français

Picarones est une plateforme open source conçue pour un contexte institutionnel — services patrimoniaux, archives, bibliothèques numériques qui disposent déjà d'un corpus en XML ALTO (issu d'une chaîne d'OCR historique) et qui veulent évaluer rigoureusement des pipelines de post-correction : re-OCR alternatif, correction LLM, mappeurs ALTO spécialisés, voting d'ensembles, etc.

C'est un banc d'essai, pas un atelier de production. Picarones charge un corpus ALTO existant, exécute les pipelines que le chercheur amène, mesure toutes les métriques pertinentes et produit un rapport HTML autonome factuel et reproductible. Pas de prescription, pas de classement automatique imposé : le rapport expose les chiffres, le chercheur arbitre.

Métriques spécifiques aux corpus patrimoniaux (CER diplomatique, scores de ligatures, abréviations médiévales, numéraux romains, foliotation, recherchabilité fuzzy plein-texte, fidélité aux marqueurs philologiques), pipelines composables, synthèse narrative factuelle au sommet du rapport, tests Friedman/Nemenyi multi-moteurs avec diagramme de différence critique, analyse Pareto coût/vitesse/CO₂, absorption d'erreur par jonction, stabilité multi-runs, comparaison contrôlée par slot, et plusieurs sources d'import (IIIF, HuggingFace, HTR-United, eScriptorium, upload ZIP).

Cas d'usage type

Une institution (archive, bibliothèque numérique, service patrimonial) a déjà OCRisé un corpus de plusieurs milliers de pages — sortie au format ALTO XML avec coordonnées de zones, lignes et mots. Cette sortie a un CER honorable mais imparfait, des erreurs typiques sur les ligatures historiques, des abréviations non développées, des noms propres mal reconnus.

L'institution veut comparer rigoureusement plusieurs stratégies de post-correction sur ce corpus existant :

ré-OCR avec un moteur alternatif (Pero OCR, Kraken, Mistral OCR…) ;
correction LLM (GPT-4o, Claude, Mistral) en mode texte seul ou image+texte ;
mappeurs ALTO spécialisés (re-segmentation des lignes, fusion des abréviations, normalisation diplomatique) ;
pipelines composées : OCR alternatif → correction LLM → mappeur ALTO.

Picarones charge le corpus ALTO, exécute chaque pipeline, mesure les métriques pertinentes (gain CER, recherchabilité fuzzy gagnée, séquences numériques préservées, erreurs introduites par le post-correcteur — critique pour les LLM qui modernisent silencieusement) et produit un rapport HTML factuel directement citable dans une publication scientifique : chaque chiffre est traçable au payload source, aucune prescription n'est imposée.

Features
Quick Start
Installation
Usage
Supported Engines
Normalization Profiles
Error Taxonomy
Project Structure
Environment Variables
CI/CD
Development
Roadmap
Contributing
License

Features

Heritage-Specific Metrics

CER (Character Error Rate) in four variants: raw, NFC-normalized, caseless, and diplomatic (historical equivalences: long s = s, u = v, i = j, etc.)
WER, MER, WIL with historical-aware tokenization (via jiwer)
Unicode confusion matrix -- fingerprint each engine's character-level errors
Ligature and diacritic scores -- track handling of fi, fl, ff, oe, ae, p-bar, and other medieval glyphs
10-class error taxonomy -- automatic classification of every error (visual confusion, abbreviation, segmentation, lacuna, over-normalization, etc.)
Bootstrap 95% confidence intervals, Wilcoxon signed-rank tests, and the Friedman test + Nemenyi post-hoc with a Critical Difference Diagram (Demšar 2006) for rigorous multi-engine comparison
Intrinsic difficulty score per document, independent of engine performance
Line-level error distribution with Gini coefficient and percentile analysis
VLM hallucination detection -- anchor score and length ratio to flag fabricated output
Cost / speed / carbon Pareto front (local vs cloud, per-token pricing model)

OCR+LLM Pipelines

Composable chains: tesseract -> gpt-4o, pero_ocr -> claude-sonnet, zero-shot VLM, etc.
Three pipeline modes: text-only post-correction, image+text post-correction, and zero-shot
Over-normalization detection -- does the LLM silently modernize historical spellings?
Versioned prompt library for medieval French, early modern French, medieval Latin, medieval English, and early modern English -- both correction and zero-shot variants

Corpus Import

Source	Method
Local folder	`picarones run --corpus ./corpus/`
IIIF manifests (institutional repositories)	`picarones import iiif <manifest-url>`
Gallica API (SRU + OCR)	`GallicaClient` / `picarones import iiif`
HuggingFace Datasets	`picarones import hf <dataset-id>`
HTR-United catalogue	`picarones import htr-united`
eScriptorium	`EScriptoriumClient`
ZIP upload (browser)	Web interface upload endpoint

Supported corpus formats: plain text pairs (image + ground truth), ALTO XML, and PAGE XML.

Interactive HTML Report

Self-contained HTML file -- works offline, no server needed (Jinja2-templated since Sprint 17)
Factual narrative synthesis at the top of the report (Sprint 19): 12 deterministic detectors extract salient facts (global leader, significant gap, stratum collapse, VLM hallucination flag, speed winner, cost outlier, Pareto alternative, ...) and render them as short sentences -- every number is traceable to the source payload, no LLM, no hallucination risk
Critical Difference Diagram (CDD) rendered server-side as static SVG -- no JS required
Cost / speed / carbon Pareto chart with toggleable axes and highlighted Pareto front
Contextual glossary: a ? icon next to every metric header opens a side panel with definition, what it measures, usage, limits, and reference (25 bilingual entries)
Advanced mode panel: visible-column picker, per-stratum filter, and opt-in personal composite score (sliders default to 0, formula always visible, explicit warning that no universal weighting exists). State is persisted in the URL.
Sortable ranking table, radar charts, histograms (powered by Chart.js)
Gallery view with dynamic filters and color-coded CER badges
GitHub-style colored diff with synchronized N-way scrolling
Triple diff view for OCR+LLM: ground truth / raw OCR / post-correction
Unicode character view: interactive confusion matrix explorer
Export to CSV, JSON, ALTO XML, PAGE XML, and annotated images

Longitudinal Tracking & Robustness

Optional SQLite database to record benchmark history across runs
CER evolution curves over time, per engine
Automatic regression detection between consecutive runs
Robustness analysis: measure engine resilience to noise, blur, rotation, resolution reduction, and binarization
Critical degradation threshold identification

Web Interface

FastAPI application with real-time Server-Sent Events (SSE) progress streaming
Upload corpus as a ZIP file directly from the browser
Dynamic engine and normalization profile selectors
Browse and re-download generated HTML reports
Bilingual French/English interface
Deployable on HuggingFace Spaces (Docker, port 7860)

Quick Start

# Clone and install
git clone https://github.com/maribakulj/Picarones.git
cd Picarones
pip install -e .

# Install Tesseract (system binary, required for the Tesseract engine)
# Ubuntu/Debian
sudo apt install tesseract-ocr tesseract-ocr-fra tesseract-ocr-lat

# macOS
brew install tesseract

# Generate a demo report (no OCR engine needed)
picarones demo --output demo_report.html

# List available engines
picarones engines

# Run a benchmark
picarones run --corpus ./corpus/ --engines tesseract --output results.json

# Generate HTML report
picarones report --results results.json --output report.html

# Launch the web interface
picarones serve --port 8080

Installation

From Source

git clone https://github.com/maribakulj/Picarones.git
cd Picarones
pip install -e ".[dev,web]"    # includes test and web dependencies

System requirements:

Python >= 3.11
Tesseract OCR 5 (for the Tesseract engine)

Docker

docker build -t picarones .
docker run -p 7860:7860 \
  -e MISTRAL_API_KEY=... \
  -e OPENAI_API_KEY=... \
  picarones

The Docker image is based on Python 3.11-slim, includes Tesseract 5 with language packs (fra, lat, eng, deu, ita, spa), and runs as a non-root user. A health check polls /health every 30 seconds.

The HuggingFace Space uses this same Docker image.

Optional Extras

Extra	Install command	What it adds
`dev`	`pip install -e ".[dev]"`	pytest, pytest-cov, httpx, FastAPI, uvicorn, python-multipart
`web`	`pip install -e ".[web]"`	FastAPI, uvicorn, python-multipart, httpx
`stats`	`pip install -e ".[stats]"`	scipy (exact Wilcoxon/Friedman/Nemenyi -- otherwise pure-Python fallback)
`llm`	`pip install -e ".[llm]"`	OpenAI, Anthropic, Mistral SDKs
`hf`	`pip install -e ".[hf]"`	HuggingFace Datasets
`pero`	`pip install -e ".[pero]"`	Pero OCR engine
`kraken`	`pip install -e ".[kraken]"`	Kraken engine
`ocr-cloud`	`pip install -e ".[ocr-cloud]"`	Google Vision, AWS (boto3), Azure Doc Intelligence
`all`	`pip install -e ".[all]"`	`web` + `hf` + `llm` + `dev` (no `ocr-cloud`)

See INSTALL.md for detailed instructions on Linux, macOS, Windows, and Docker.

Usage

CLI Commands

Command	Description
`picarones run`	Run a full benchmark on a corpus
`picarones report`	Generate an HTML report from JSON results
`picarones demo`	Generate a demo report with synthetic data (no engine required)
`picarones metrics`	Calculate CER/WER between two text files
`picarones engines`	List all available OCR engines and LLM adapters
`picarones info`	Display version and system information
`picarones serve`	Launch the FastAPI web interface
`picarones history`	Query longitudinal benchmark history (SQLite)
`picarones robustness`	Run robustness analysis with degraded images
`picarones import iiif`	Import corpus from an IIIF manifest (any institutional repository). HTR-United and HuggingFace imports are exposed through the web interface (`/api/htr-united/import`, `/api/huggingface/import`).

Examples:

# Benchmark with Tesseract, French language, PSM 6
picarones run --corpus ./manuscripts/ --engines tesseract --lang fra --psm 6 \
  --output results.json --verbose

# Compare two text files
picarones metrics --reference ground_truth.txt --hypothesis ocr_output.txt

# Import 10 pages from any IIIF manifest URL
picarones import iiif https://institution.example/iiif/xxx/manifest.json --pages 1-10

# HuggingFace and HTR-United imports are available via the web UI at
#   http://localhost:8000/  (endpoints POST /api/huggingface/import and /api/htr-united/import)

# View benchmark history with regression detection
picarones history --engine tesseract --regression

# Robustness demo (noise, blur, rotation, resolution)
picarones robustness --corpus ./gt/ --engine tesseract --demo

# Fail CI if CER exceeds threshold
picarones run --corpus ./corpus/ --engines tesseract --fail-if-cer-above 0.15

Web Interface

picarones serve --host 0.0.0.0 --port 8080

API endpoints include:

Endpoint	Method	Description
`/`	GET	Main single-page application
`/api/status`	GET	Version and application status
`/api/engines`	GET	Available OCR/LLM engines
`/api/normalization/profiles`	GET	Normalization profiles (read dynamically)
`/api/benchmark/start`	POST	Start a benchmark job (returns `job_id`)
`/api/benchmark/{job_id}/stream`	GET	SSE real-time progress stream
`/api/benchmark/{job_id}/cancel`	POST	Cancel a running benchmark
`/api/corpus/browse`	GET	Browse server-side corpus folders
`/api/htr-united/catalogue`	GET	Browse HTR-United catalogue
`/api/huggingface/search`	GET	Search HuggingFace datasets
`/reports/{filename}`	GET	Download generated HTML reports

Pipeline Modes

Picarones supports three modes for OCR+LLM pipelines:

Mode	Description	Model type
`zero_shot`	LLM receives the image directly and transcribes without prior OCR	VLM (vision)
`post_correction_texte`	OCR produces raw text, then LLM corrects it	Text-only LLM
`post_correction_image_texte`	OCR produces raw text, then LLM receives both image and text for correction	VLM (vision)

Example: ministral-3b-latest is a text-only model and should use post_correction_texte. GPT-4o and Claude support all three modes.

Supported Engines

Engine	Type	Execution Mode	Installation
Tesseract 5	Local CLI	CPU (ProcessPool)	`pip install pytesseract` + system binary
Pero OCR	Local Python	CPU (ProcessPool)	`pip install pero-ocr`
Kraken	Local Python	CPU (ProcessPool)	`pip install kraken`
Mistral OCR	Cloud API	IO (ThreadPool)	`MISTRAL_API_KEY` env var
Google Vision	Cloud API	IO (ThreadPool)	`GOOGLE_APPLICATION_CREDENTIALS` env var
Azure Doc Intelligence	Cloud API	IO (ThreadPool)	`AZURE_DOC_INTEL_ENDPOINT` + `AZURE_DOC_INTEL_KEY`
GPT-4o (VLM)	LLM API	IO (ThreadPool)	`OPENAI_API_KEY` env var
Claude Sonnet (VLM)	LLM API	IO (ThreadPool)	`ANTHROPIC_API_KEY` env var
Mistral Large (LLM)	LLM API	IO (ThreadPool)	`MISTRAL_API_KEY` env var
Ollama (local LLM)	Local LLM	IO (ThreadPool)	`ollama serve` running locally
Custom engine	CLI or API	Configurable	YAML declaration, no code required

Engines declare their execution_mode ("io" or "cpu"), allowing the runner to use ThreadPoolExecutor for IO-bound engines and ProcessPoolExecutor for CPU-bound engines simultaneously.

Normalization Profiles

Picarones ships 11 built-in normalization profiles designed for historical text comparison. These reduce noise from expected orthographic variation so metrics reflect genuine OCR errors, not historical spelling differences. The canonical list is defined in picarones/core/normalization.py (NORMALIZATION_PROFILES) and is exposed dynamically via /api/normalization/profiles.

Profile	Period	Key equivalences
`nfc`	Any	Unicode NFC normalization only
`caseless`	Any	NFC + case folding (`casefold`)
`minimal`	Any	NFC + long s (ſ -> s)
`medieval_french`	12th-15th c.	ſ=s, u=v, i=j, y=i, æ=ae, œ=oe, ꝑ=per, & = et
`early_modern_french`	16th-18th c.	ſ=s, æ=ae, œ=oe
`medieval_latin`	12th-15th c.	ſ=s, u=v, i=j, ꝑ=per, ꝓ=pro
`medieval_english`	12th-15th c.	ſ=s, u=v, i=j, þ=th, ȝ=y, ꝑ=per, ꝓ=pro
`early_modern_english`	16th-18th c.	ſ=s, u=v, i=j, vv=w, þ=th, ð=th, ȝ=y
`secretary_hand`	16th-17th c.	Early Modern English + secretary hand visual confusions
`sans_ponctuation`	Any	NFC + strips `. , ; : ! ? ' " - – — ( ) [ ]`
`sans_apostrophes`	Any	NFC + strips straight (`'`) and typographic (`’`) apostrophes

Custom profiles can be loaded from YAML files with user-defined diplomatic tables and/or exclude_chars sets.

Error Taxonomy

Every character-level error is automatically classified into one of 10 categories:

Class	Name	Description
1	`visual_confusion`	Morphologically similar characters (rn/m, l/1, O/0, u/n)
2	`diacritic_error`	Missing, incorrect, or spurious diacritical mark
3	`case_error`	Case difference only (A/a)
4	`ligature_error`	Ligature not resolved or incorrectly resolved
5	`abbreviation_error`	Medieval abbreviation not expanded
6	`hapax`	Word not found in any reference lexicon
7	`segmentation_error`	Token fusion or fragmentation (words/lines)
8	`oov_character`	Character outside the engine's vocabulary
9	`lacuna`	Text present in ground truth but absent from OCR output
10	`over_normalization`	LLM silently modernized a historical spelling

Project Structure

picarones/
├── __init__.py                 # Version (1.0.0), package metadata
├── __main__.py                 # `python -m picarones`
├── cli.py                      # Click CLI: run, demo, report, metrics, engines, info,
│                               #   serve, import iiif, history, robustness
├── fixtures.py                 # Realistic synthetic test data (medieval documents)
├── i18n.py                     # Back-compat shim loading report/i18n/{fr,en}.json
│
├── core/
│   ├── corpus.py               # Corpus loading (folder, ALTO XML, PAGE XML)
│   ├── metrics.py              # CER, WER, MER, WIL (via jiwer)
│   ├── normalization.py        # Unicode normalization, 11 diplomatic/exclusion profiles
│   ├── statistics.py           # Bootstrap CI, Wilcoxon, Friedman, Nemenyi, CDD SVG
│   ├── runner.py               # Benchmark orchestrator (ThreadPool + ProcessPool)
│   ├── results.py              # DocumentResult, BenchmarkResults, JSON export
│   ├── confusion.py            # Unicode confusion matrix
│   ├── char_scores.py          # Ligature and diacritic scores
│   ├── taxonomy.py             # 10-class error taxonomy
│   ├── structure.py            # Structural analysis (blocks, lines, words)
│   ├── image_quality.py        # Image quality metrics (contrast, noise, resolution)
│   ├── difficulty.py           # Intrinsic difficulty score per document
│   ├── hallucination.py        # VLM hallucination detection
│   ├── line_metrics.py         # Line-level error distribution (Gini, percentiles)
│   ├── history.py              # SQLite longitudinal tracking
│   ├── robustness.py           # Robustness analysis (noise, blur, rotation, resolution)
│   ├── pricing.py              # Cost model, EngineCost, Pareto front
│   └── narrative/              # Factual narrative engine (Sprint 16-19)
│       ├── facts.py            # Fact model, 12 FactType, DetectorRegistry
│       ├── detectors.py        # 12 detectors (global_leader_cer, significant_gap,
│       │                       #   stratum_winner/collapse, error_profile_outlier,
│       │                       #   llm_hallucination_flag, robustness_fragile,
│       │                       #   speed_winner, confidence_warning,
│       │                       #   statistical_tie, pareto_alternative, cost_outlier)
│       ├── arbiter.py          # Sort by importance, dedup, anti-contradiction
│       ├── renderer.py         # YAML template rendering via str.format_map
│       └── templates/{fr,en}.yaml
│
├── data/
│   └── pricing.yaml            # Indicative cost table (OCR local/cloud + LLM)
│
├── engines/
│   ├── base.py                 # BaseOCREngine (execution_mode: "io" | "cpu")
│   ├── tesseract.py            # Tesseract 5 adapter (CPU)
│   ├── pero_ocr.py             # Pero OCR adapter (CPU)
│   ├── mistral_ocr.py          # Mistral OCR API (/v1/ocr endpoint)
│   ├── google_vision.py        # Google Cloud Vision adapter
│   └── azure_doc_intel.py      # Azure Document Intelligence adapter
│
├── llm/
│   ├── base.py                 # BaseLLMAdapter interface
│   ├── openai_adapter.py       # OpenAI / GPT-4o adapter
│   ├── anthropic_adapter.py    # Anthropic / Claude adapter
│   ├── mistral_adapter.py      # Mistral chat completions adapter
│   └── ollama_adapter.py       # Ollama local LLM adapter
│
├── pipelines/
│   ├── base.py                 # OCRLLMPipeline orchestrator
│   └── over_normalization.py   # Over-normalization detection
│
├── prompts/                    # 8 versioned prompt templates
│   ├── correction_medieval_french.txt
│   ├── correction_image_medieval_french.txt
│   ├── correction_imprime_ancien.txt
│   ├── correction_medieval_english.txt
│   ├── correction_early_modern_english.txt
│   ├── zero_shot_medieval_french.txt
│   ├── zero_shot_imprime_ancien.txt
│   └── zero_shot_medieval_english.txt
│
├── report/
│   ├── generator.py            # Orchestrates Jinja2 rendering (617 lines since Sprint 17)
│   ├── diff_utils.py           # Diff computation utilities
│   ├── templates/              # Jinja2 partials (Sprint 17)
│   │   ├── base.html.j2        # assembles everything via {% include %}
│   │   ├── _header.html, _footer.html, _styles.css, _app.js
│   │   ├── _critical_difference.html, _narrative_summary.html, _side_panels.html
│   │   └── view_ranking.html, view_gallery.html, view_document.html,
│   │       view_analyses.html, view_characters.html
│   ├── i18n/                   # FR/EN translations (Sprint 17 -- extracted from i18n.py)
│   │   ├── fr.json
│   │   └── en.json
│   ├── glossary/               # Contextual glossary (Sprint 21)
│   │   ├── fr.yaml             # 25 bilingual entries (definition, measures, usage,
│   │   └── en.yaml             #   limits, reference)
│   └── vendor/                 # Vendored Chart.js
│
├── web/
│   ├── app.py                  # FastAPI app (SSE, ZIP upload, dynamic endpoints)
│   └── static/                 # CSS assets
│
└── importers/
    ├── iiif.py                 # IIIF manifest importer
    ├── gallica.py              # Gallica API client (institutional digital library)
    ├── htr_united.py           # HTR-United catalogue importer
    ├── huggingface.py          # HuggingFace Datasets importer
    └── escriptorium.py         # eScriptorium client

docs/                           # User + developer documentation (Sprint 22)
├── case-studies/               # Two labelled case studies ("Cas d'école")
│   ├── 01-registres-paroissiaux.md
│   └── 02-edition-critique.md
├── user/
│   └── reading-a-report.md     # Anatomy, suggested reading order, advanced panel
└── developer/
    ├── index.md
    ├── narrative-engine.md
    ├── extending-glossary.md
    └── extending-i18n.md

tests/                          # 1242 tests (1 skipped: scipy optional)
.github/workflows/
├── ci.yml                      # CI: Python 3.11/3.12, Linux/macOS/Windows, ruff lint
└── sync_to_huggingface.yml     # Auto-sync to HuggingFace Space on push to main
Dockerfile                      # Multi-stage Docker build for HuggingFace Spaces

Environment Variables

Configure API keys depending on which engines and LLM adapters you use:

# LLM APIs
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export MISTRAL_API_KEY="..."

# Cloud OCR APIs (optional)
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/credentials.json"
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_DEFAULT_REGION="eu-west-1"
export AZURE_DOC_INTEL_ENDPOINT="https://..."
export AZURE_DOC_INTEL_KEY="..."

For deployment on HuggingFace Spaces, set these in Settings > Variables and secrets.

CI/CD

GitHub Actions (`ci.yml`)

Triggers: push to main/develop/feature/*/sprint/*/claude/*, PRs to main/develop, manual dispatch
Matrix: Python 3.11 + 3.12 on Linux, macOS, and Windows
Jobs:
1. Tests -- full pytest suite (1242 passing, 1 skipped when scipy is absent) with coverage uploaded to Codecov
2. Demo -- end-to-end demo report generation with history and robustness
3. Build -- wheel and sdist with twine validation
4. Lint -- ruff check picarones/ tests/ (E, W, F; ignores E501, E402). The ruff config lives in pyproject.toml under [tool.ruff] so CI, make lint and direct invocation all produce the same result -- blocking on F401 / E741.

HuggingFace Sync (`sync_to_huggingface.yml`)

Automatically pushes main to the HuggingFace Space Ma-Ri-Ba-Ku/Picarones
Requires the HF_TOKEN secret in GitHub repository settings

Development

# Install with dev + web dependencies
pip install -e ".[dev,web]"

# Run the test suite
pytest tests/ -q --tb=short

# Run with coverage
pytest tests/ --cov=picarones --cov-report=term-missing

# Generate a demo report
picarones demo --output demo_report.html

# Launch the web UI in development mode
picarones serve --port 8080

# Full refresh (useful in Codespaces)
git pull && pip install -e ".[dev,web]" && picarones demo --output demo.html

Test suite: pytest tests/ -> 1242 passed, 1 skipped (the skip is intentional when the optional scipy extra is not installed).

Key development conventions:

Never use bare except Exception: pass -- always log with logger.warning()
Normalization profiles are read dynamically from picarones/core/normalization.py -- never hardcode them in endpoint handlers
Engines declare their execution_mode ("io" or "cpu") so the runner can select the appropriate executor
python-multipart must remain in dependencies (FastAPI checks at import time)

Roadmap

Sprint	Status	Deliverables
1	Done	Project structure, Tesseract, Pero OCR, CER/WER, CLI
2	Done	HTML report v1: Chart.js, colored diff, gallery
3	Done	OCR+LLM pipelines, GPT-4o, Claude, Mistral, Ollama
4	Done	Cloud OCR APIs, IIIF import, diplomatic normalization
5	Done	Advanced metrics: confusion matrix, ligatures, 9-class taxonomy
6	Done	FastAPI web interface, HTR-United, HuggingFace, bilingual UI
7	Done	HTML report v2: Wilcoxon, bootstrap, clustering, difficulty score
8	Done	eScriptorium, Gallica API, SQLite history, robustness analysis
9	Done	Documentation, packaging, Docker, CI/CD, PyInstaller, v1.0.0-Beta
10	Done	Line error distribution (Gini), VLM hallucination detection
11	Done	Internationalization FR/EN, English normalization profiles
12	Done	Browser ZIP upload, macOS file filtering, dynamic model selector
13	Done	pyproject.toml cleanup, runner parallelization, NDJSON streaming, Wilcoxon validation
14	Done	Robust engine filtering, corpus validation
15	Done	Fix empty OCR+LLM pipeline output (Mistral ContentChunk normalization, `finish_reason` logging)
16	Done	`line_metrics` + `hallucination` wired into runner/`EngineReport`; narrative engine foundations (`core/narrative/` with `Fact` / `DetectorRegistry`); Pillow `getdata`->`tobytes`, silent excepts -> explicit warnings
17	Done	Report refactor: `generator.py` 3690 -> 617 lines via Jinja2; monolithic HTML template split into 10 files under `picarones/report/templates/`; i18n migrated to `report/i18n/{fr,en}.json`; +16 non-regression tests
18	Done	Friedman test + Nemenyi post-hoc + Critical Difference Diagram (Demšar 2006); `detect_statistical_tie` enabled; SVG rendered server-side; +41 tests
19	Done	Factual narrative engine complete: 9 new detectors, arbiter (importance + anti-contradiction), YAML templates renderer, `_narrative_summary.html` partial, anti-hallucination traceability test; +32 tests
20	Done	Cost model + Pareto view: `core/pricing.py` + `data/pricing.yaml`, `compute_pareto_front`, Chart.js Pareto chart with cost/speed/carbon toggles, `pareto_alternative` and `cost_outlier` detectors; +28 tests
21	Done	Contextual glossary (25 bilingual entries) + advanced-mode side panel (visible columns, strata filters, opt-in composite score, URL state persistence); +21 tests
22	Done	Case studies (`docs/case-studies/`), user guide (`docs/user/reading-a-report.md`), three developer guides (`docs/developer/`); +18 tests

Known Issues & Improvement Opportunities

This section captures the findings of the Sprint 22 audit. None of them block the current release (all 1242 tests pass, lint clean), but each represents a sensible next step.

Architecture / refactor

picarones/web/app.py is 3072 lines (FastAPI routes, corpus upload, SSE, ZIP flattening, HTML delivery, model registry all in one module). Candidate split: app_routes.py / app_corpus.py / app_jobs.py / app_models.py.
picarones/core/statistics.py is 1127 lines mixing bootstrap CI, Wilcoxon, Friedman, Nemenyi table, Pareto front and CDD SVG. Splitting into statistics/bootstrap.py, statistics/tests.py, statistics/pareto.py, statistics/cdd_svg.py would shorten import graphs and ease review.
picarones/cli.py is 971 lines — each Click command could live in its own module under picarones/cli/ and be registered via cli.add_command(...).
picarones/core/runner.py is 847 lines — orchestrator is reasonable but edges past the 500-line guideline; extracting the per-document worker + the partial-NDJSON writer would reduce mental load.
picarones/core/narrative/detectors.py is 680 lines — all 12 detectors live together; one file per FactType (or per importance tier) would make additions safer.

Back-compat shim

picarones/i18n.py is a 66-line shim that reads picarones/report/i18n/{fr,en}.json. Since Sprint 17 the JSON files are the source of truth and only picarones/report/generator.py:654 still imports through the shim. Either promote the shim to picarones.report.i18n (renaming the import) or delete the file and import the loader directly.

Explicit engine declarations

MistralOCREngine, GoogleVisionEngine and AzureDocIntelEngine inherit the implicit execution_mode = "io" default from BaseOCREngine. For clarity and to protect against a future default flip, declare it explicitly (as TesseractEngine and PeroOCREngine already do for "cpu").

Test coverage gaps

No dedicated unit tests for picarones/core/char_scores.py (exercised only transitively).
No unit tests for the cloud engine adapters themselves (mistral_ocr.py, google_vision.py, azure_doc_intel.py) — they are only reached via integration fixtures.
pytest installed as a uv tool doesn't see project dependencies automatically; document pip install -e ".[dev,web,stats]" in the pytest environment or switch to an in-repo venv to avoid "ModuleNotFoundError: No module named 'yaml'" surprises.

Documentation

CHANGELOG.md stops at Sprint 9 (2025-03). Sprints 10-22 are described in CLAUDE.md and this README but should be back-ported into CHANGELOG.md to follow Keep-a-Changelog.
SPECS.md predates the narrative engine, Pareto view and glossary — worth a pass.
Some code comments and docstrings are still in French while user-facing strings are bilingual; harmonising module docstrings in English would make the project more contributor-friendly.

CI / packaging

sync_to_huggingface.yml uses git push --force hf main unconditionally — safe today but worth documenting because a non-main branch push would silently rewrite the Space.
picarones.spec (PyInstaller) is still present but not exercised in CI — either add a build-exe job or mark the spec as community-maintained.

Security (nothing critical)

ZIP upload flattening in web/app.py rejects absolute paths and .. traversal but does not check for symlinks inside archives. Python's zipfile doesn't extract symlinks, so the risk is theoretical; adding an explicit check (ZipInfo.external_attr & 0xA000) is a belt-and-braces improvement.
API keys are read from environment variables only (no hardcoded fallback) — good.

Contributing

See CONTRIBUTING.md for instructions on adding an OCR engine, an LLM adapter, or submitting a pull request.

License

Apache License 2.0

Picarones

Use case

En français

Cas d'usage type

Table of Contents

Features

Heritage-Specific Metrics

OCR+LLM Pipelines

Corpus Import

Interactive HTML Report

Longitudinal Tracking & Robustness

Web Interface

Quick Start

Installation

From Source

Docker

Optional Extras

Usage

CLI Commands

Web Interface

Pipeline Modes

Supported Engines

Normalization Profiles

Error Taxonomy

Project Structure

Environment Variables

CI/CD

GitHub Actions (ci.yml)

HuggingFace Sync (sync_to_huggingface.yml)

Development

Roadmap

Known Issues & Improvement Opportunities

Architecture / refactor

Back-compat shim

Explicit engine declarations

Test coverage gaps

Documentation

CI / packaging

Security (nothing critical)

Contributing

License

GitHub Actions (`ci.yml`)

HuggingFace Sync (`sync_to_huggingface.yml`)