diff --git "a/GENERATION_PIPELINE_DOCUMENTATION.md" "b/GENERATION_PIPELINE_DOCUMENTATION.md" new file mode 100644--- /dev/null +++ "b/GENERATION_PIPELINE_DOCUMENTATION.md" @@ -0,0 +1,3267 @@ +# DocGenie Generation Pipeline & API Documentation + +**Version:** 1.0 +**Last Updated:** February 7, 2026 +**Purpose:** Comprehensive reference for the DocGenie synthetic document generation system + +--- + +## Table of Contents + +1. [Overview](#overview) +2. [Pipeline Architecture](#pipeline-architecture) +3. [Pipeline Stages (01-19)](#pipeline-stages-01-19) +4. [API Implementation](#api-implementation) +5. [Core Models & Utilities](#core-models--utilities) +6. [Configuration & Constants](#configuration--constants) +7. [Usage Examples](#usage-examples) +8. [Error Handling & Debugging](#error-handling--debugging) + +--- + +## Overview + +DocGenie is a sophisticated 19-stage pipeline for generating synthetic document datasets with ground truth annotations. It supports multiple document understanding tasks: + +- **Document Question Answering (QA)** +- **Key Information Extraction (KIE)** +- **Document Layout Analysis (DLA)** +- **Document Classification (CLS)** + +### Key Features + +- **LLM-Powered Generation**: Uses Claude/Gemini/Open-source models to generate diverse document content +- **Realistic Handwriting**: Diffusion model-based handwriting synthesis with author-specific styles +- **Visual Element Integration**: Stamps, logos, barcodes, charts, and photos +- **Multi-Task Support**: Task-specific ground truth formatting and validation +- **Quality Assurance**: Comprehensive validation, OCR verification, and error tracking +- **Modular Design**: Each pipeline stage is independently executable with clear inputs/outputs + +### Technology Stack + +- **LLM APIs**: Claude (Anthropic), Gemini, DeepSeek, Qwen +- **PDF Rendering**: Playwright (Chromium), PyMuPDF +- **OCR**: Microsoft Azure OCR +- **Handwriting**: Custom diffusion model +- **Image Processing**: PIL, OpenCV +- **API Framework**: FastAPI +- **Data Processing**: Pandas, NumPy + +--- + +## Pipeline Architecture + +### High-Level Flow + +``` +┌─────────────────────────────────────────────────────────────────────────┐ +│ DOCGENIE GENERATION PIPELINE │ +└─────────────────────────────────────────────────────────────────────────┘ + +┌──────────────────────┐ +│ PHASE 1: SELECTION │ +└──────────────────────┘ + ↓ +[01] Select Seeds ────────────► seeds.csv, clusters.csv + (Cluster-based diverse seed selection) + +┌──────────────────────┐ +│ PHASE 2: LLM GEN │ +└──────────────────────┘ + ↓ +[02] Prompt LLM ──────────────► batch_results/ (JSON) + │ (Claude API batched calls) + ↓ +[03] Process Response ────────► raw_html/, raw_annotations/ + (Extract HTML & GT from responses) + +┌──────────────────────┐ +│ PHASE 3: RENDERING │ +└──────────────────────┘ + ↓ +[04] Render PDF Initial ──────► pdf_initial/, geometries/ + │ (HTML→PDF with geometry extraction) + ↓ +[05] Extract BBoxes ──────────► pdf_word_bboxes/, pdf_char_bboxes/ + │ (PyMuPDF text extraction) + ↓ +[06] Extract Layout ──────────► layout_element_definitions/ + (DLA/KIE-specific annotations) + +┌──────────────────────┐ +│ PHASE 4: EXTRACTION │ +└──────────────────────┘ + ↓ +[07] Extract Handwriting ─────► handwriting_definitions/ + │ (Identify handwriting regions) + ↓ +[08] Extract Visual Elements ─► visual_element_definitions/ + (Stamp/logo/barcode placeholders) + +┌──────────────────────┐ +│ PHASE 5: GENERATION │ +└──────────────────────┘ + ↓ +[09] Create Handwriting ──────► handwriting_images/ + │ (Diffusion model generation) + ↓ +[10] Create Visual Elements ──► visual_element_images/ + (Generate/select stamps, logos, etc.) + +┌──────────────────────┐ +│ PHASE 6: COMPOSITION │ +└──────────────────────┘ + ↓ +[11] Render PDF (2nd Pass) ───► pdf_without_handwriting_placeholder/ + │ (Remove handwriting placeholders) + ↓ +[12] Insert Handwriting ──────► pdf_with_handwriting/ + │ (Overlay handwriting images) + ↓ +[13] Insert Visual Elements ──► pdf_final/ + │ (Overlay stamps, logos, etc.) + ↓ +[14] Render Image ────────────► images/ + (PDF→PNG conversion) + +┌──────────────────────┐ +│ PHASE 7: FINALIZATION│ +└──────────────────────┘ + ↓ +[15] Perform OCR ─────────────► final_word_bboxes/, final_segment_bboxes/ + │ (Microsoft OCR) + ↓ +[16] Normalize BBoxes ────────► normalized_word_bboxes/, normalized_segment_bboxes/ + (Pixel→[0,1] coordinates) + +┌──────────────────────┐ +│ PHASE 8: VALIDATION │ +└──────────────────────┘ + ↓ +[17] GT Preparation ──────────► verified_gt/ + │ (Fuzzy matching, BIO tagging) + ↓ +[18] Analyze ─────────────────► dataset_log.json + │ (Statistics, cost analysis) + ↓ +[19] Create Debug Data ───────► debug/ subdirectories + (Visualizations for inspection) +``` + +### Data Flow Between Stages + +``` +Seed Images ──┐ + ├──► [02] ──► HTML + GT ──► [04] ──► PDF + Geometries +Prompt Params ┘ │ + ├──► [05] ──► BBoxes + │ │ + │ ├──► [07] ──► HW Defs ──► [09] ──► HW Images ──┐ + │ │ │ + │ └──► [08] ──► VE Defs ──► [10] ──► VE Images ──┤ + │ │ + └──► [11] ──► PDF (no HW) ──┬──► [12] ◄─────────────────────────┤ + │ (Insert HW) │ + └──► [13] ◄────────────────────────┘ + (Insert VE) + ↓ + [14] ──► Image + ↓ + [15] ──► OCR BBoxes + ↓ + [16] ──► Normalized + ↓ + [17] ──► Verified GT +``` + +--- + +## Pipeline Stages (01-19) + +### Stage 01: Select Seeds + +**File:** `pipeline_01_select_seeds.py` + +**Purpose:** Select diverse seed images from base dataset using clustering algorithms to ensure variety in the generated documents. + +**Key Functions:** +- `main()`: Orchestrates seed selection process +- `downscale_and_compress_seeds()`: Prepares seed images for efficient API transmission +- `plot_class_distribution()`: Visualizes class balance in selected seeds +- `visualize_cluster_histogram()`: Shows distribution across clusters + +**Process:** +1. Load embeddings from base dataset +2. Perform clustering (KMeans or other algorithms) +3. Sample N seeds per cluster +4. Downscale and compress images (JPEG, max dimension) +5. Save seed manifest and cluster assignments + +**Inputs:** +- `SynDatasetDefinition` configuration +- Base dataset name (e.g., `docvqa`, `cord`, `publaynet`) +- Clustering parameters from constants + +**Outputs:** +``` +seeds.csv # Selected seed document IDs per prompt call +clusters.csv # Cluster assignments for all documents +seeds/ (directory) # Preprocessed seed images (JPEG, compressed) +``` + +**Configuration Parameters:** +- `EMBEDDING_MODEL`: Specifies which embedding model was used +- `IMAGE_MAX_DIMENSION`: Max width/height for compression +- `JPEG_QUALITY`: Compression quality (0-100) + +**Example Usage:** +```python +from docgenie.generation import pipeline_01_select_seeds + +pipeline_01_select_seeds.main( + syndatadef_path="data/syn_dataset_definitions/docvqa_alpha=1.0.yaml", + base_dataset="docvqa" +) +``` + +--- + +### Stage 02: Prompt LLM + +**File:** `pipeline_02_prompt_llm.py` + +**Purpose:** Send batched prompts to LLM APIs (Claude, Gemini, DeepSeek, Qwen) with seed images to generate document HTML and ground truth. + +**Key Functions:** +- `main()`: Main orchestrator for LLM prompting +- `create_batched_messages()`: Constructs API-compatible message batches +- `track_batch_completion()`: Polls API for batch status +- Cost calculation utilities in `pipeline_01/cost.py` + +**Process:** +1. Load seed images and encode as base64 +2. Build prompts from template with parameter injection +3. Create batched API requests (Claude Batch API for cost efficiency) +4. Submit batches and track completion +5. Save results for processing in stage 03 + +**Inputs:** +- Prompt template from `data/prompt_templates//` +- Seed images from stage 01 +- API credentials from environment variables +- `SynDatasetDefinition` parameters + +**Outputs:** +``` +prompt_batches/ # Batch metadata (batch IDs, status) +message_results/ # JSON response files per batch +logs/ # Prompting logs and progress +``` + +**API Configuration:** +- **Claude:** Uses Batch API with prompt caching for cost efficiency +- **Batch Size:** Configurable via `BATCH_SIZE` constant +- **Polling Interval:** Configurable wait time between status checks +- **Model Selection:** Specified in `SynDatasetDefinition.llm_model` + +**Cost Tracking:** +- Input/output token counts per request +- Cached token usage (for Claude) +- Total cost estimation per batch + +**Example Configuration:** +```yaml +# In syn_dataset_definition YAML +llm_model: "claude-sonnet-4-20250514" +prompt_template: "DocGenie" +num_solutions: 1 # Documents per prompt +language: "English" +doc_type: "business and administrative" +``` + +--- + +### Stage 03: Process Response + +**File:** `pipeline_03_process_response.py` + +**Purpose:** Extract and validate HTML documents and ground truth annotations from LLM responses. + +**Key Functions:** +- `main()`: Main processor +- `extract_html_from_message()`: Regex-based HTML extraction from markdown code blocks +- `extract_gt_from_html()`: Parse JSON ground truth from `