diff --git "a/GENERATION_PIPELINE_DOCUMENTATION.md" "b/GENERATION_PIPELINE_DOCUMENTATION.md"
new file mode 100644--- /dev/null
+++ "b/GENERATION_PIPELINE_DOCUMENTATION.md"
@@ -0,0 +1,3267 @@
+# DocGenie Generation Pipeline & API Documentation
+
+**Version:** 1.0  
+**Last Updated:** February 7, 2026  
+**Purpose:** Comprehensive reference for the DocGenie synthetic document generation system
+
+---
+
+## Table of Contents
+
+1. [Overview](#overview)
+2. [Pipeline Architecture](#pipeline-architecture)
+3. [Pipeline Stages (01-19)](#pipeline-stages-01-19)
+4. [API Implementation](#api-implementation)
+5. [Core Models & Utilities](#core-models--utilities)
+6. [Configuration & Constants](#configuration--constants)
+7. [Usage Examples](#usage-examples)
+8. [Error Handling & Debugging](#error-handling--debugging)
+
+---
+
+## Overview
+
+DocGenie is a sophisticated 19-stage pipeline for generating synthetic document datasets with ground truth annotations. It supports multiple document understanding tasks:
+
+- **Document Question Answering (QA)**
+- **Key Information Extraction (KIE)**
+- **Document Layout Analysis (DLA)**
+- **Document Classification (CLS)**
+
+### Key Features
+
+- **LLM-Powered Generation**: Uses Claude/Gemini/Open-source models to generate diverse document content
+- **Realistic Handwriting**: Diffusion model-based handwriting synthesis with author-specific styles
+- **Visual Element Integration**: Stamps, logos, barcodes, charts, and photos
+- **Multi-Task Support**: Task-specific ground truth formatting and validation
+- **Quality Assurance**: Comprehensive validation, OCR verification, and error tracking
+- **Modular Design**: Each pipeline stage is independently executable with clear inputs/outputs
+
+### Technology Stack
+
+- **LLM APIs**: Claude (Anthropic), Gemini, DeepSeek, Qwen
+- **PDF Rendering**: Playwright (Chromium), PyMuPDF
+- **OCR**: Microsoft Azure OCR
+- **Handwriting**: Custom diffusion model
+- **Image Processing**: PIL, OpenCV
+- **API Framework**: FastAPI
+- **Data Processing**: Pandas, NumPy
+
+---
+
+## Pipeline Architecture
+
+### High-Level Flow
+
+```
+┌─────────────────────────────────────────────────────────────────────────┐
+│                        DOCGENIE GENERATION PIPELINE                      │
+└─────────────────────────────────────────────────────────────────────────┘
+
+┌──────────────────────┐
+│  PHASE 1: SELECTION  │
+└──────────────────────┘
+    ↓
+[01] Select Seeds ────────────► seeds.csv, clusters.csv
+                                (Cluster-based diverse seed selection)
+
+┌──────────────────────┐
+│ PHASE 2: LLM GEN     │
+└──────────────────────┘
+    ↓
+[02] Prompt LLM ──────────────► batch_results/ (JSON)
+    │                           (Claude API batched calls)
+    ↓
+[03] Process Response ────────► raw_html/, raw_annotations/
+                                (Extract HTML & GT from responses)
+
+┌──────────────────────┐
+│ PHASE 3: RENDERING   │
+└──────────────────────┘
+    ↓
+[04] Render PDF Initial ──────► pdf_initial/, geometries/
+    │                           (HTML→PDF with geometry extraction)
+    ↓
+[05] Extract BBoxes ──────────► pdf_word_bboxes/, pdf_char_bboxes/
+    │                           (PyMuPDF text extraction)
+    ↓
+[06] Extract Layout ──────────► layout_element_definitions/
+                                (DLA/KIE-specific annotations)
+
+┌──────────────────────┐
+│ PHASE 4: EXTRACTION  │
+└──────────────────────┘
+    ↓
+[07] Extract Handwriting ─────► handwriting_definitions/
+    │                           (Identify handwriting regions)
+    ↓
+[08] Extract Visual Elements ─► visual_element_definitions/
+                                (Stamp/logo/barcode placeholders)
+
+┌──────────────────────┐
+│ PHASE 5: GENERATION  │
+└──────────────────────┘
+    ↓
+[09] Create Handwriting ──────► handwriting_images/
+    │                           (Diffusion model generation)
+    ↓
+[10] Create Visual Elements ──► visual_element_images/
+                                (Generate/select stamps, logos, etc.)
+
+┌──────────────────────┐
+│ PHASE 6: COMPOSITION │
+└──────────────────────┘
+    ↓
+[11] Render PDF (2nd Pass) ───► pdf_without_handwriting_placeholder/
+    │                           (Remove handwriting placeholders)
+    ↓
+[12] Insert Handwriting ──────► pdf_with_handwriting/
+    │                           (Overlay handwriting images)
+    ↓
+[13] Insert Visual Elements ──► pdf_final/
+    │                           (Overlay stamps, logos, etc.)
+    ↓
+[14] Render Image ────────────► images/
+                                (PDF→PNG conversion)
+
+┌──────────────────────┐
+│ PHASE 7: FINALIZATION│
+└──────────────────────┘
+    ↓
+[15] Perform OCR ─────────────► final_word_bboxes/, final_segment_bboxes/
+    │                           (Microsoft OCR)
+    ↓
+[16] Normalize BBoxes ────────► normalized_word_bboxes/, normalized_segment_bboxes/
+                                (Pixel→[0,1] coordinates)
+
+┌──────────────────────┐
+│ PHASE 8: VALIDATION  │
+└──────────────────────┘
+    ↓
+[17] GT Preparation ──────────► verified_gt/
+    │                           (Fuzzy matching, BIO tagging)
+    ↓
+[18] Analyze ─────────────────► dataset_log.json
+    │                           (Statistics, cost analysis)
+    ↓
+[19] Create Debug Data ───────► debug/ subdirectories
+                                (Visualizations for inspection)
+```
+
+### Data Flow Between Stages
+
+```
+Seed Images ──┐
+              ├──► [02] ──► HTML + GT ──► [04] ──► PDF + Geometries
+Prompt Params ┘                                         │
+                                                        ├──► [05] ──► BBoxes
+                                                        │               │
+                                                        │               ├──► [07] ──► HW Defs ──► [09] ──► HW Images ──┐
+                                                        │               │                                              │
+                                                        │               └──► [08] ──► VE Defs ──► [10] ──► VE Images ──┤
+                                                        │                                                              │
+                                                        └──► [11] ──► PDF (no HW) ──┬──► [12] ◄─────────────────────────┤
+                                                                                    │           (Insert HW)            │
+                                                                                    └──► [13] ◄────────────────────────┘
+                                                                                             (Insert VE)
+                                                                                                  ↓
+                                                                                            [14] ──► Image
+                                                                                                  ↓
+                                                                                            [15] ──► OCR BBoxes
+                                                                                                  ↓
+                                                                                            [16] ──► Normalized
+                                                                                                  ↓
+                                                                                            [17] ──► Verified GT
+```
+
+---
+
+## Pipeline Stages (01-19)
+
+### Stage 01: Select Seeds
+
+**File:** `pipeline_01_select_seeds.py`
+
+**Purpose:** Select diverse seed images from base dataset using clustering algorithms to ensure variety in the generated documents.
+
+**Key Functions:**
+- `main()`: Orchestrates seed selection process
+- `downscale_and_compress_seeds()`: Prepares seed images for efficient API transmission
+- `plot_class_distribution()`: Visualizes class balance in selected seeds
+- `visualize_cluster_histogram()`: Shows distribution across clusters
+
+**Process:**
+1. Load embeddings from base dataset
+2. Perform clustering (KMeans or other algorithms)
+3. Sample N seeds per cluster
+4. Downscale and compress images (JPEG, max dimension)
+5. Save seed manifest and cluster assignments
+
+**Inputs:**
+- `SynDatasetDefinition` configuration
+- Base dataset name (e.g., `docvqa`, `cord`, `publaynet`)
+- Clustering parameters from constants
+
+**Outputs:**
+```
+seeds.csv                    # Selected seed document IDs per prompt call
+clusters.csv                 # Cluster assignments for all documents
+seeds/ (directory)           # Preprocessed seed images (JPEG, compressed)
+```
+
+**Configuration Parameters:**
+- `EMBEDDING_MODEL`: Specifies which embedding model was used
+- `IMAGE_MAX_DIMENSION`: Max width/height for compression
+- `JPEG_QUALITY`: Compression quality (0-100)
+
+**Example Usage:**
+```python
+from docgenie.generation import pipeline_01_select_seeds
+
+pipeline_01_select_seeds.main(
+    syndatadef_path="data/syn_dataset_definitions/docvqa_alpha=1.0.yaml",
+    base_dataset="docvqa"
+)
+```
+
+---
+
+### Stage 02: Prompt LLM
+
+**File:** `pipeline_02_prompt_llm.py`
+
+**Purpose:** Send batched prompts to LLM APIs (Claude, Gemini, DeepSeek, Qwen) with seed images to generate document HTML and ground truth.
+
+**Key Functions:**
+- `main()`: Main orchestrator for LLM prompting
+- `create_batched_messages()`: Constructs API-compatible message batches
+- `track_batch_completion()`: Polls API for batch status
+- Cost calculation utilities in `pipeline_01/cost.py`
+
+**Process:**
+1. Load seed images and encode as base64
+2. Build prompts from template with parameter injection
+3. Create batched API requests (Claude Batch API for cost efficiency)
+4. Submit batches and track completion
+5. Save results for processing in stage 03
+
+**Inputs:**
+- Prompt template from `data/prompt_templates/<template_name>/`
+- Seed images from stage 01
+- API credentials from environment variables
+- `SynDatasetDefinition` parameters
+
+**Outputs:**
+```
+prompt_batches/              # Batch metadata (batch IDs, status)
+message_results/             # JSON response files per batch
+logs/                        # Prompting logs and progress
+```
+
+**API Configuration:**
+- **Claude:** Uses Batch API with prompt caching for cost efficiency
+- **Batch Size:** Configurable via `BATCH_SIZE` constant
+- **Polling Interval:** Configurable wait time between status checks
+- **Model Selection:** Specified in `SynDatasetDefinition.llm_model`
+
+**Cost Tracking:**
+- Input/output token counts per request
+- Cached token usage (for Claude)
+- Total cost estimation per batch
+
+**Example Configuration:**
+```yaml
+# In syn_dataset_definition YAML
+llm_model: "claude-sonnet-4-20250514"
+prompt_template: "DocGenie"
+num_solutions: 1  # Documents per prompt
+language: "English"
+doc_type: "business and administrative"
+```
+
+---
+
+### Stage 03: Process Response
+
+**File:** `pipeline_03_process_response.py`
+
+**Purpose:** Extract and validate HTML documents and ground truth annotations from LLM responses.
+
+**Key Functions:**
+- `main()`: Main processor
+- `extract_html_from_message()`: Regex-based HTML extraction from markdown code blocks
+- `extract_gt_from_html()`: Parse JSON ground truth from `<script id="GT">` tags
+- `validate_and_save_gt()`: Task-specific validation and formatting
+
+**Process:**
+1. Load message results from stage 02
+2. Extract HTML content using regex patterns
+3. Parse and validate ground truth JSON
+4. Apply task-specific formatting (QA, KIE, DLA, CLS)
+5. Save raw HTML and annotations separately
+
+**Inputs:**
+- Message results from `message_results/` (stage 02)
+- `SynDatasetDefinition` task type and prompt format
+
+**Outputs:**
+```
+raw_html/                    # HTML files (one per document)
+raw_annotations/             # Ground truth JSON files
+  qa/                        # QA format: {"question": "answer", ...}
+  kie/                       # KIE format: {"entity": "value", ...}
+  dla/                       # DLA format: {"element_id": "label", ...}
+  cls/                       # CLS format: {"class": "category"}
+logs/message_processing/     # Processing logs per message
+```
+
+**Ground Truth Formats:**
+
+**QA Format:**
+```json
+{
+  "What is the invoice number?": "INV-12345",
+  "What is the total amount?": "$1,234.56"
+}
+```
+
+**KIE Format:**
+```json
+{
+  "company_name": "Acme Corp",
+  "invoice_date": "2024-01-15",
+  "total_amount": "1234.56"
+}
+```
+
+**DLA Format:**
+```json
+{
+  "layout_0": "title",
+  "layout_1": "text",
+  "layout_2": "table"
+}
+```
+
+**Validation Checks:**
+- Expected document count matches actual
+- Valid JSON structure in GT
+- Task-specific field presence
+- HTML completeness (valid tags)
+
+**Error Handling:**
+- Malformed HTML → Logged, skipped
+- Missing GT → Document flagged in logs
+- Invalid JSON → Parsing error logged
+
+---
+
+### Stage 04: Render PDF and Extract Geometries
+
+**File:** `pipeline_04_render_pdf_and_extract_geos.py`
+
+**Purpose:** Convert HTML to PDF with automatic content-based page sizing and extract element geometries (positions, dimensions) for downstream processing.
+
+**Key Functions:**
+- `main()`: Async batch rendering orchestrator
+- `render_pdf_with_playwright()`: Single PDF rendering using Playwright/Chromium
+- `preprocess_css()`: Remove conflicting CSS `@page` rules
+- `validate_pdf()`: Check page count and file integrity
+- `extract_geometries()`: Parse element positions from injected JavaScript
+
+**Process:**
+1. Load raw HTML from stage 03
+2. Inject geometry extraction JavaScript
+3. Preprocess CSS (remove fixed page sizes)
+4. Render with Playwright in headless Chromium
+5. Measure content dimensions dynamically
+6. Export PDF with calculated page size
+7. Extract and save element geometries
+
+**Inputs:**
+- Raw HTML files from `raw_html/` (stage 03)
+
+**Outputs:**
+```
+pdf_initial/                 # Initial PDFs
+geometries/                  # Element geometry JSON files
+  {doc_id}.json              # Contains positions, dimensions, CSS classes
+render_html/                 # HTML with geometry extraction scripts
+debug/pdf_with_geos/         # Debug PDFs with geometry overlays (optional)
+logs/rendering/              # Rendering logs and errors
+```
+
+**Geometry JSON Structure:**
+```json
+{
+  "page_width_mm": 210.0,
+  "page_height_mm": 297.0,
+  "elements": [
+    {
+      "id": "layout_0",
+      "type": "div",
+      "class": "title",
+      "rect": {
+        "x": 20.0,
+        "y": 30.0,
+        "width": 170.0,
+        "height": 15.0
+      },
+      "text": "Invoice",
+      "attributes": {
+        "data-label": "title",
+        "data-handwriting": null,
+        "data-visual-element": null
+      }
+    }
+  ]
+}
+```
+
+**Key Features:**
+- **Automatic Page Sizing:** No fixed dimensions; page size adapts to content
+- **Concurrent Rendering:** Semaphore-controlled parallel processing
+- **Geometry Extraction:** JavaScript injected to capture element positions
+- **CSS Coordinate Conversion:** 96 DPI (CSS) → 72 DPI (PDF)
+- **Retry Logic:** Up to 3 attempts with configurable timeout
+- **Debug Visualizations:** Optional overlay of geometries on PDFs
+
+**Configuration Constants:**
+- `CHROMIUM_CONCURRENCY`: 10 (parallel render limit)
+- `PER_PDF_RENDER_TIMEOUT`: 60 seconds
+- `PER_PDF_RENDER_MAX_RETRIES`: 3 attempts
+- `PDF_POINT_SCALING`: 72/96 for DPI conversion
+
+**Error Handling:**
+- Timeout → Retry with increased timeout
+- Multi-page PDFs → Flagged and skipped
+- Missing geometries → Error logged, document marked invalid
+
+---
+
+### Stage 05: Extract BBoxes from PDF
+
+**File:** `pipeline_05_extract_bboxes_from_pdf.py`
+
+**Purpose:** Extract word-level and character-level bounding boxes from PDFs using PyMuPDF for accurate text positioning.
+
+**Key Functions:**
+- `main()`: Main extraction orchestrator
+- `extract_bboxes_from_pdf()`: PyMuPDF-based extraction
+- `verify_char_to_word_mapping()`: Validate character-to-word relationships
+- `create_bbox_debug_pdf()`: Generate debug visualizations
+
+**Process:**
+1. Load PDFs from stage 04
+2. Extract words with PyMuPDF `get_text("words")`
+3. Extract characters with `get_text("rawdict")`
+4. Map characters to parent words
+5. Validate mappings (required for handwriting splitting)
+6. Save both word and character bboxes
+
+**Inputs:**
+- PDFs from `pdf_initial/` (stage 04)
+
+**Outputs:**
+```
+pdf_word_bboxes/             # Word-level bounding boxes
+  {doc_id}.json              # Format: [x0, y0, x1, y1, text]
+pdf_char_bboxes/             # Character-level bounding boxes (if mappable)
+  {doc_id}.json              # Format: [x0, y0, x1, y1, char, word_idx]
+debug/pdf_bboxes/            # Debug PDFs with bbox overlays (optional)
+logs/bbox_extraction/        # Extraction logs
+```
+
+**BBox JSON Format:**
+
+**Word BBoxes:**
+```json
+[
+  {"x0": 20.0, "y0": 30.0, "x1": 60.0, "y1": 45.0, "text": "Invoice"},
+  {"x0": 65.0, "y0": 30.0, "x1": 95.0, "y1": 45.0, "text": "Date"}
+]
+```
+
+**Char BBoxes (with word mapping):**
+```json
+[
+  {"x0": 20.0, "y0": 30.0, "x1": 25.0, "y1": 45.0, "char": "I", "word_idx": 0},
+  {"x0": 25.0, "y0": 30.0, "x1": 30.0, "y1": 45.0, "char": "n", "word_idx": 0}
+]
+```
+
+**Key Features:**
+- **Dual-Level Extraction:** Both word and character granularity
+- **Mapping Validation:** Ensures characters can be mapped to words (critical for handwriting)
+- **Debug Visualizations:** Color-coded bbox overlays on PDFs
+- **PDF Coordinate System:** Uses 72 DPI (PDF points)
+
+**Character-to-Word Mapping:**
+Required for handwriting splitting in stage 07. If mapping fails (e.g., due to ligatures or complex fonts), char bboxes are not saved, and handwriting will use word-level fallback.
+
+**Error Handling:**
+- Empty PDFs → Skipped with warning
+- Unmappable characters → Word-level fallback
+- Invalid bboxes (negative dimensions) → Filtered out
+
+---
+
+### Stage 06: Extract Layout Element Definitions and Annotation GT
+
+**File:** `pipeline_06_extract_layout_element_definitions_and_annotation_gt.py`
+
+**Purpose:** Extract layout element definitions for Document Layout Analysis (DLA) and annotation-based ground truth for Key Information Extraction (KIE).
+
+**Key Functions:**
+- `main()`: Task router
+- `handle_dla()`: Process DLA tasks
+- `handle_kie()`: Process KIE tasks
+- `parse_layout_elements()`: Extract layout elements from geometries
+- `parse_kie_fields()`: Extract KIE annotated fields from geometries
+
+**Process (DLA):**
+1. Load geometries from stage 04
+2. Filter elements with `data-label` attribute
+3. Validate labels against task-specific valid labels
+4. Extract bounding boxes and labels
+5. Save layout element definitions
+
+**Process (KIE):**
+1. Load geometries from stage 04
+2. Filter elements with `data-annotation` attribute
+3. Extract entity names and text values
+4. Validate against expected entity schema
+5. Save raw annotations for stage 17
+
+**Inputs:**
+- Geometries from `geometries/` (stage 04)
+- Valid labels from `SynDatasetDefinition.valid_labels`
+- Task type from `SynDatasetDefinition.task`
+
+**Outputs:**
+```
+layout_element_definitions/  # For DLA tasks
+  {doc_id}.json              # [{id, label, rect}, ...]
+raw_annotations/             # For KIE tasks
+  kie_annotations/
+    {doc_id}.json            # {entity_name: text_value, ...}
+```
+
+**Layout Element Definition Format (DLA):**
+```json
+[
+  {
+    "id": "layout_0",
+    "label": "title",
+    "rect": {"x": 20.0, "y": 30.0, "width": 170.0, "height": 15.0}
+  },
+  {
+    "id": "layout_1",
+    "label": "text",
+    "rect": {"x": 20.0, "y": 50.0, "width": 170.0, "height": 50.0}
+  }
+]
+```
+
+**KIE Annotation Format:**
+```json
+{
+  "company_name": "Acme Corporation",
+  "invoice_number": "INV-12345",
+  "invoice_date": "January 15, 2024",
+  "total_amount": "$1,234.56"
+}
+```
+
+**Validation Checks:**
+- **Missing labels:** Elements without `data-label` → Warning
+- **Invalid labels:** Labels not in valid_labels → Error
+- **Zero dimensions:** Width or height = 0 → Error
+- **Multiple labels (KIE only):** One element, multiple entities → Error
+- **Missing text:** KIE elements without text → Warning
+
+**Task-Specific Handling:**
+- **DLA:** Converts geometries to `LayoutBBox` format for stage 17
+- **KIE:** Extracts entity-value pairs for BIO tagging in stage 17
+- **QA/CLS:** No processing in this stage (uses raw GT from stage 03)
+
+---
+
+### Stage 07: Extract Handwriting
+
+**File:** `pipeline_07_extract_handwriting.py`
+
+**Purpose:** Identify text regions marked for handwriting generation, map them to character/word bounding boxes, and prepare definitions for the diffusion model.
+
+**Key Functions:**
+- `main()`: Main extraction orchestrator
+- `parse_handwriting_geometries()`: Extract elements with `data-handwriting` attribute
+- `map_handwriting_to_bboxes()`: Match handwriting text to OCR bboxes spatially
+- `split_long_words()`: Split words exceeding diffusion model's character limit
+- `extract_handwriting_style()`: Parse author ID from CSS class
+
+**Process:**
+1. Load geometries from stage 04
+2. Filter elements with `data-handwriting` attribute
+3. Load word/character bboxes from stage 05
+4. Spatially match handwriting regions to bboxes
+5. Split long words for diffusion model constraints
+6. Extract author ID for style consistency
+7. Save handwriting definitions with bbox mappings
+
+**Inputs:**
+- Geometries from `geometries/` (stage 04)
+- Word bboxes from `pdf_word_bboxes/` (stage 05)
+- Character bboxes from `pdf_char_bboxes/` (stage 05, if available)
+
+**Outputs:**
+```
+handwriting_definitions/
+  {doc_id}.json              # Handwriting region definitions
+logs/handwriting_extraction/ # Extraction logs and warnings
+```
+
+**Handwriting Definition Format:**
+```json
+[
+  {
+    "id": "hw0",
+    "text": "John Smith",
+    "author_id": "author1",
+    "bboxes": [
+      "20.0,30.0,60.0,45.0,John",
+      "65.0,30.0,95.0,45.0,Smith"
+    ],
+    "rect": {
+      "x": 20.0,
+      "y": 30.0,
+      "width": 75.0,
+      "height": 15.0
+    },
+    "is_signature": false
+  }
+]
+```
+
+**Key Features:**
+- **Author ID Extraction:** Parses CSS classes like `handwriting-author1` for style consistency
+- **Spatial Matching:** Uses bbox overlap/proximity to map handwriting regions to text
+- **Long Word Splitting:** Splits words > `MAX_HANDWRITING_CHARS` (default: 7) into multiple images
+- **Signature Detection:** Identifies signature fields for special handling
+- **Character-Level Fallback:** Uses word bboxes if char-level mapping unavailable
+
+**Configuration Constants:**
+- `MAX_HANDWRITING_CHARS`: 7 (max characters per diffusion generation)
+- `SPATIAL_MATCH_THRESHOLD`: Pixel tolerance for bbox matching
+
+**Mapping Logic:**
+1. Find all words whose bboxes overlap/intersect with handwriting rect
+2. Extract text from matched bboxes
+3. Compare with expected handwriting text
+4. If char bboxes available: split long words at char boundaries
+5. If char bboxes unavailable: split at word boundaries
+
+**Error Handling:**
+- No bbox matches → Warning, handwriting region skipped
+- Text mismatch → Logged, uses best match
+- Missing char bboxes → Word-level fallback (may affect quality)
+
+---
+
+### Stage 08: Extract Visual Element Definitions
+
+**File:** `pipeline_08_extract_visual_element_definitions.py`
+
+**Purpose:** Extract placeholders for visual elements (stamps, logos, barcodes, charts, photos) from geometries for generation in stage 10.
+
+**Key Functions:**
+- `main()`: Main extraction orchestrator
+- `parse_visual_element_geometries()`: Extract elements with `data-visual-element` attribute
+- `parse_css_dimension()`: Parse width/height from CSS
+- `parse_css_rotation()`: Extract rotation angle from CSS transform
+
+**Process:**
+1. Load geometries from stage 04
+2. Filter elements with `data-visual-element` attribute
+3. Parse element type (stamp, logo, barcode, etc.)
+4. Extract dimensions and rotation from CSS
+5. Parse content (e.g., stamp text, barcode value)
+6. Validate element types and dimensions
+7. Save visual element definitions
+
+**Inputs:**
+- Geometries from `geometries/` (stage 04)
+
+**Outputs:**
+```
+visual_element_definitions/
+  {doc_id}.json              # Visual element definitions
+logs/visual_element_extraction/ # Extraction logs
+```
+
+**Visual Element Definition Format:**
+```json
+[
+  {
+    "id": "ve0",
+    "type": "stamp",
+    "type_unmapped": "stamp",
+    "content": "CONFIDENTIAL",
+    "rect": {
+      "x": 150.0,
+      "y": 250.0,
+      "width": 60.0,
+      "height": 30.0
+    },
+    "rotation": -15.0
+  },
+  {
+    "id": "ve1",
+    "type": "barcode",
+    "type_unmapped": "barcode",
+    "content": "1234567890",
+    "rect": {
+      "x": 20.0,
+      "y": 270.0,
+      "width": 100.0,
+      "height": 20.0
+    },
+    "rotation": 0.0
+  }
+]
+```
+
+**Supported Visual Element Types:**
+- `stamp`: Text-based stamps (e.g., "APPROVED", "CONFIDENTIAL")
+- `logo`: Company/brand logos (selected from prefabs)
+- `barcode`: Code128 barcodes
+- `chart`: Charts and graphs (selected from prefabs)
+- `photo`: Photographic images (selected from prefabs)
+
+**Type Mapping:**
+Maps LLM-generated type names to standard types using `VISUAL_ELEMENT_TYPE_MAPPING` constant.
+
+**Key Features:**
+- **Content Extraction:** Parses text content for stamps, values for barcodes
+- **Rotation Support:** Extracts CSS `rotate()` transform angles
+- **Dimension Parsing:** Handles CSS units (px, mm, %, etc.)
+- **Type Validation:** Warns about unknown types, uses fallback mapping
+
+**Validation Checks:**
+- Unknown type → Logged, mapped to closest known type
+- Zero dimensions → Error, element skipped
+- Invalid rotation angle → Defaults to 0°
+- Missing content (for stamps/barcodes) → Warning
+
+**CSS Parsing Examples:**
+```css
+/* Rotation extraction */
+transform: rotate(-15deg);        → -15.0
+transform: rotate(0.5turn);       → 180.0
+
+/* Dimension extraction */
+width: 60mm;                      → 60.0 (mm)
+height: 30px;                     → Converted to mm
+```
+
+---
+
+### Stage 09: Create Handwriting Images
+
+**File:** `pipeline_09_create_handwriting_images.py`
+
+**Purpose:** Generate realistic handwriting images using a diffusion model, with per-author style consistency.
+
+**Key Functions:**
+- `main()`: Main generation orchestrator
+- `generate_handwriting_diffusion()`: Call diffusion model API/local model
+- `add_handwriting_blur()`: Optional post-processing for realism
+
+**Process:**
+1. Load handwriting definitions from stage 07
+2. Group by author ID for style consistency
+3. For each text segment:
+   - Generate handwriting image with diffusion model
+   - Apply optional blur for realism
+   - Save image with metadata
+4. Track author-to-writer style mapping
+5. Save generation logs
+
+**Inputs:**
+- Handwriting definitions from `handwriting_definitions/` (stage 07)
+- Diffusion model checkpoint (local or API)
+
+**Outputs:**
+```
+handwriting_images/
+  {doc_id}/
+    hw0_0.png               # First bbox of handwriting region hw0
+    hw0_1.png               # Second bbox (if multi-word)
+    hw1_0.png
+logs/handwriting_generation/ # Generation logs with author mappings
+```
+
+**Handwriting Image Specifications:**
+- **Format:** PNG with transparency
+- **Height:** 40 pixels (configurable via `HANDWRITING_HEIGHT_PX`)
+- **Padding:** 0 pixels horizontal (configurable via `HANDWRITING_PADDING_PX`)
+- **Background:** Transparent
+- **Color:** Black text on transparent
+
+**Diffusion Model Integration:**
+Located in `handwriting_diffusion/` module:
+- `generate_handwriting_diffusion_raw.py`: Core generation logic
+- `text_encoder.py`: Text encoding for model input
+- `tokenizer.py`: Character tokenization
+
+**Key Features:**
+- **Author-Style Consistency:** Same author ID → consistent handwriting style
+- **Writer Style Mapping:** Maps author IDs to writer style IDs (e.g., "author1" → writer_42)
+- **Batch Processing:** Generates multiple images efficiently
+- **Optional Blur:** Post-processing for more realistic appearance
+- **Quality Control:** Validates generated images (non-empty, correct dimensions)
+
+**Configuration Constants:**
+- `HANDWRITING_HEIGHT_PX`: 40 pixels
+- `HANDWRITING_PADDING_PX`: 0 pixels
+- `DIFFUSION_NUM_INFERENCE_STEPS`: 50 (generation quality/speed tradeoff)
+- `HANDWRITING_BLUR_ENABLED`: Optional blur toggle
+- `HANDWRITING_STYLES`: List of available writer styles
+
+**Author-to-Writer Mapping Example:**
+```json
+{
+  "author1": "writer_42",
+  "author2": "writer_17",
+  "author3": "writer_89"
+}
+```
+
+**Error Handling:**
+- Generation failure → Retry up to 3 times
+- Empty image → Logged, document flagged
+- Model timeout → Skipped, logged
+
+---
+
+### Stage 10: Create Visual Elements
+
+**File:** `pipeline_10_create_visual_elements.py`
+
+**Purpose:** Generate or select visual element images (stamps, logos, barcodes, charts, photos) based on definitions from stage 08.
+
+**Key Functions:**
+- `main()`: Main generation orchestrator
+- `route_visual_element_generation()`: Router for element type
+- `generate_stamp()`: Create stamp image with text
+- `select_logo()`: Random selection from logo prefabs
+- `generate_barcode()`: Create Code128 barcode
+- `select_photo()`: Random selection from photo prefabs
+- `select_chart()`: Random selection from chart prefabs
+
+**Process:**
+1. Load visual element definitions from stage 08
+2. For each element:
+   - Route to type-specific generator
+   - Generate or select image
+   - Resize to target dimensions
+   - Save with transparency (if applicable)
+3. Cache prefab directories for performance
+4. Save generation logs
+
+**Inputs:**
+- Visual element definitions from `visual_element_definitions/` (stage 08)
+- Prefab directories in `data/visual_element_prefabs/`:
+  - `logos/`
+  - `photos/`
+  - `charts/`
+
+**Outputs:**
+```
+visual_element_images/
+  {doc_id}/
+    ve0.png                 # Stamp image
+    ve1.png                 # Logo image
+    ve2.png                 # Barcode image
+logs/visual_element_generation/ # Generation logs
+```
+
+**Type-Specific Generation:**
+
+**Stamps:**
+- Font: Configurable (default: Arial Bold)
+- Color: Configurable (default: red/blue)
+- Background: Transparent
+- Border: Optional rounded rectangle
+- Rotation: Applied during insertion (stage 13)
+
+**Logos:**
+- Source: Random selection from `data/visual_element_prefabs/logos/`
+- Format: PNG with transparency preserved
+- Caching: Directory contents cached after first scan
+
+**Barcodes:**
+- Type: Code128 (supports alphanumeric)
+- Library: `python-barcode`
+- Background: White
+- Content validation: Numeric or alphanumeric
+
+**Photos:**
+- Source: Random selection from `data/visual_element_prefabs/photos/`
+- Format: JPEG or PNG
+- Aspect ratio: Preserved during resize
+
+**Charts/Figures:**
+- Source: Random selection from `data/visual_element_prefabs/charts/`
+- Format: PNG with transparency
+- Types: Bar charts, line graphs, pie charts, etc.
+
+**Key Features:**
+- **Type-Specific Logic:** Each element type has dedicated generation function
+- **Prefab Caching:** Directory scans cached for performance
+- **Transparent Backgrounds:** Stamps and some logos support transparency
+- **Content Validation:** Barcodes validate numeric content
+- **Aspect Ratio Preservation:** Images scaled without distortion
+
+**Configuration Constants:**
+- `STAMP_FONT_SIZE`: Calculated from target dimensions
+- `STAMP_BORDER_WIDTH`: 2 pixels
+- `BARCODE_DPI`: 300 for high quality
+- `PREFAB_CACHE_SIZE`: In-memory cache limit
+
+**Error Handling:**
+- Missing prefab directory → Error, element skipped
+- Empty prefab directory → Warning, fallback placeholder
+- Invalid barcode content → Logged, uses fallback text
+- Image generation failure → Placeholder created
+
+---
+
+### Stage 11: Render PDF Second Pass
+
+**File:** `pipeline_11_render_pdf_second_pass.py`
+
+**Purpose:** Re-render PDF without handwriting placeholders (replaced with blank spaces) to prepare for handwriting image insertion in stage 12.
+
+**Key Functions:**
+- `main()`: Main rendering orchestrator
+- `render_pdf_playwright_async()`: Async Playwright rendering
+- `remove_handwriting_placeholders_from_html()`: Strip handwriting elements
+
+**Process:**
+1. Load HTML from `render_html/` (stage 04)
+2. Remove all elements with `data-handwriting` attribute
+3. Re-render PDF using same dimensions from stage 04
+4. Validate PDF output
+5. Save PDFs without handwriting placeholders
+
+**Inputs:**
+- HTML from `render_html/` (stage 04)
+- Page dimensions from stage 04 logs
+
+**Outputs:**
+```
+pdf_without_handwriting_placeholder/
+  {doc_id}.pdf              # PDFs with handwriting regions blank
+logs/rendering_second_pass/ # Rendering logs
+```
+
+**Key Differences from Stage 04:**
+- **Uses Pre-calculated Dimensions:** No content measurement needed
+- **Handwriting Elements Removed:** Not just hidden (visibility: hidden), but removed from DOM
+- **No Geometry Extraction:** Geometries already saved in stage 04
+- **Faster Rendering:** No JavaScript injection for measurement
+
+**HTML Preprocessing:**
+```html
+<!-- Before (Stage 04) -->
+<div data-handwriting="author1" class="handwriting">John Smith</div>
+
+<!-- After (Stage 11) -->
+<!-- Element completely removed -->
+```
+
+**Why This Stage Exists:**
+Handwriting placeholders in HTML use system fonts, which don't match the diffusion-generated handwriting. Removing them creates blank spaces where handwriting images will be inserted in stage 12.
+
+**Rendering Configuration:**
+- Same timeout and retry logic as stage 04
+- Reuses Playwright browser context for efficiency
+- No debug output (geometries not extracted)
+
+**Error Handling:**
+- Multi-page PDF → Flagged and skipped
+- Rendering timeout → Retry with increased timeout
+- HTML parsing error → Logged, document marked invalid
+
+---
+
+### Stage 12: Insert Handwriting Images
+
+**File:** `pipeline_12_insert_handwriting_images.py`
+
+**Purpose:** Overlay generated handwriting images onto PDF pages using PyMuPDF, with precise positioning and natural variation.
+
+**Key Functions:**
+- `main()`: Main insertion orchestrator
+- `insert_handwriting_into_pdf()`: Per-document insertion
+- `scale_image_with_aspect_ratio()`: Resize images while preserving aspect
+- `group_bboxes_by_line()`: Group multi-word handwriting by line
+
+**Process:**
+1. Load PDFs from stage 11
+2. Load handwriting images from stage 09
+3. Load handwriting definitions (bbox mappings) from stage 07
+4. For each handwriting region:
+   - Group bboxes by line/block
+   - Scale images to fit bboxes
+   - Apply random offsets for natural variation
+   - Insert images at calculated positions
+5. Save PDFs with handwriting
+
+**Inputs:**
+- PDFs from `pdf_without_handwriting_placeholder/` (stage 11)
+- Handwriting images from `handwriting_images/` (stage 09)
+- Handwriting definitions from `handwriting_definitions/` (stage 07)
+
+**Outputs:**
+```
+pdf_with_handwriting/
+  {doc_id}.pdf              # PDFs with handwriting inserted
+logs/handwriting_insertion/ # Insertion logs
+debug/handwriting_insertion/ # Debug PDFs with bbox overlays (optional)
+```
+
+**Insertion Logic:**
+
+**1. Image Scaling:**
+- High-res scaling: 3x upsampling before insertion
+- Aspect ratio preserved
+- Left-aligned within bbox (respects layout rect)
+
+**2. Positioning:**
+- **X coordinate:** Left edge of bbox + random offset
+- **Y coordinate:** Top edge of bbox + random offset
+- **Multi-word lines:** Consistent Y offset for line
+
+**3. Random Offsets:**
+```python
+x_offset = random.uniform(-MAX_HANDWRITING_RAND_X, MAX_HANDWRITING_RAND_X)
+y_offset = random.uniform(-MAX_HANDWRITING_RAND_Y, MAX_HANDWRITING_RAND_Y)
+```
+
+**4. Block/Line Grouping:**
+For multi-word handwriting (e.g., "John Smith"):
+- Group bboxes by Y coordinate (same line)
+- Apply consistent Y offset to entire line
+- Individual X offsets per word for natural spacing
+
+**Key Features:**
+- **High-Resolution Insertion:** 3x scaling for quality
+- **Natural Variation:** Random offsets simulate handwriting imperfection
+- **Line Consistency:** Multi-word lines maintain baseline
+- **Aspect Ratio Preservation:** Images not distorted
+- **Transparency Support:** PNG alpha channel preserved
+
+**Configuration Constants:**
+- `HANDWRITING_IMAGE_UPSCALE_FACTOR`: 3x
+- `MAX_HANDWRITING_RAND_X`: ±2 pixels
+- `MAX_HANDWRITING_RAND_Y`: ±1 pixel
+- `HANDWRITING_LINE_Y_CONSISTENCY`: Same Y offset per line
+
+**Coordinate System:**
+- PDF uses 72 DPI (points)
+- Bboxes from stage 07 are in PDF coordinates
+- No conversion needed
+
+**Error Handling:**
+- Missing handwriting image → Warning, bbox skipped
+- Image too large for bbox → Scaled down with warning
+- PyMuPDF insertion failure → Logged, document flagged
+
+---
+
+### Stage 13: Insert Visual Elements
+
+**File:** `pipeline_13_insert_visual_elements.py`
+
+**Purpose:** Overlay visual element images (stamps, logos, barcodes, etc.) onto PDF pages with precise positioning and rotation.
+
+**Key Functions:**
+- `main()`: Main insertion orchestrator
+- `insert_visual_elements_into_pdf()`: Per-document insertion
+- `scale_image_with_aspect_ratio()`: Resize images (same as stage 12)
+
+**Process:**
+1. Load PDFs from stage 12
+2. Load visual element images from stage 10
+3. Load visual element definitions from stage 08
+4. For each visual element:
+   - Scale image to fit bbox
+   - Calculate centered position
+   - Apply rotation (if specified)
+   - Insert image at calculated position
+5. Save final PDFs
+6. If no visual elements: copy PDF from stage 12
+
+**Inputs:**
+- PDFs from `pdf_with_handwriting/` (stage 12)
+- Visual element images from `visual_element_images/` (stage 10)
+- Visual element definitions from `visual_element_definitions/` (stage 08)
+
+**Outputs:**
+```
+pdf_final/
+  {doc_id}.pdf              # Final PDFs with all elements
+logs/visual_element_insertion/ # Insertion logs
+debug/visual_element_insertion/ # Debug PDFs (optional)
+```
+
+**Insertion Logic:**
+
+**1. Image Scaling:**
+- High-res scaling: 3x upsampling
+- Aspect ratio preserved
+- Centered within bbox (not left-aligned like handwriting)
+
+**2. Positioning:**
+- **X coordinate:** Center of bbox - half image width
+- **Y coordinate:** Center of bbox - half image height
+
+**3. Rotation:**
+- Applied via PyMuPDF transformation matrix
+- Rotation around image center
+- Angle from visual element definition
+
+**Key Differences from Stage 12 (Handwriting):**
+- **Centered placement:** Visual elements centered in bbox
+- **No random offsets:** Precise placement for logos/stamps
+- **Rotation support:** Stamps often rotated for "APPROVED" effect
+- **Fallback:** Copies PDF if no visual elements (ensures output exists)
+
+**Rotation Transformation:**
+```python
+# PyMuPDF rotation matrix
+rotation_matrix = fitz.Matrix(rotation_angle)
+image_rect = image_rect * rotation_matrix
+```
+
+**Key Features:**
+- **High-Resolution Insertion:** 3x scaling for quality
+- **Centered Alignment:** Visual elements centered in bboxes
+- **Rotation Support:** Arbitrary angles for stamps
+- **Transparency Preservation:** PNG alpha channel maintained
+- **Fallback Handling:** Copies PDF if no visual elements
+
+**Configuration Constants:**
+- `VISUAL_ELEMENT_UPSCALE_FACTOR`: 3x
+- `ROTATION_PRECISION`: Angle precision in degrees
+
+**Coordinate System:**
+- Same as stage 12 (PDF 72 DPI)
+- Rotation applied after positioning
+
+**Error Handling:**
+- Missing visual element image → Warning, element skipped
+- Image too large for bbox → Scaled down with warning
+- PyMuPDF insertion failure → Logged, document flagged
+- No visual elements → PDF copied from stage 12
+
+---
+
+### Stage 14: Render Image
+
+**File:** `pipeline_14_render_image.py`
+
+**Purpose:** Convert final PDFs to high-quality PNG images for OCR and dataset distribution.
+
+**Key Functions:**
+- `main()`: Main conversion orchestrator
+- `convert_pdf_to_image()`: PDF to PNG conversion
+
+**Process:**
+1. Load PDFs from stage 13
+2. Convert each PDF to PNG using custom PDF-to-image module
+3. Validate image dimensions
+4. Save images
+
+**Inputs:**
+- PDFs from `pdf_final/` (stage 13)
+
+**Outputs:**
+```
+images/
+  {doc_id}.png              # Final document images
+logs/image_rendering/       # Conversion logs
+```
+
+**Image Specifications:**
+- **Format:** PNG
+- **DPI:** Configurable (default: 200 DPI for quality OCR)
+- **Color Mode:** RGB (24-bit)
+- **Compression:** PNG lossless
+
+**PDF-to-Image Module:**
+Located in custom module (not standard library):
+- Handles PDF rendering at specified DPI
+- Single-page conversion only (multi-page PDFs skipped)
+- Uses PDF coordinate system: 72 DPI internally
+
+**Key Features:**
+- **High DPI:** 200+ DPI for accurate OCR
+- **Single Page Only:** Multi-page PDFs flagged as errors
+- **Lossless Compression:** PNG preserves all details
+- **Size Validation:** Checks image dimensions match PDF
+
+**Configuration Constants:**
+- `IMAGE_DPI`: 200 (OCR quality vs file size tradeoff)
+- `IMAGE_MAX_DIMENSION`: Optional max width/height
+
+**Coordinate System Conversion:**
+```
+PDF: 210mm × 297mm @ 72 DPI → 595 × 842 points
+PNG: 210mm × 297mm @ 200 DPI → 1654 × 2339 pixels
+```
+
+**Why This Stage:**
+- OCR performs better on high-DPI images
+- Images are final output format for datasets
+- PNG preserves quality better than JPEG for text
+
+**Error Handling:**
+- Multi-page PDF → Flagged and skipped
+- Conversion failure → Logged, document marked invalid
+- Empty image → Error, document flagged
+- Dimension mismatch → Warning logged
+
+---
+
+### Stage 15: Perform OCR
+
+**File:** `pipeline_15_perform_ocr.py`
+
+**Purpose:** Perform Optical Character Recognition on final images to obtain accurate word and line-level bounding boxes, essential for documents with handwriting or visual elements.
+
+**Key Functions:**
+- `main()`: Main OCR orchestrator
+- `call_microsoft_ocr()`: Microsoft Azure OCR API call
+- `convert_ocr_to_bbox_format()`: Transform OCR results to internal format
+- `aggregate_words_to_segments()`: Group words into lines
+
+**Process:**
+1. Determine which documents need OCR:
+   - Has handwriting → Requires OCR
+   - Has visual elements → Requires OCR
+   - Neither → Copy PDF bboxes from stage 05
+2. For documents requiring OCR:
+   - Call Microsoft OCR service
+   - Parse word-level bboxes
+   - Aggregate into line-level segments
+   - Convert coordinates to PDF space
+3. Save final bboxes
+
+**Inputs:**
+- Images from `images/` (stage 14)
+- Handwriting definitions from `handwriting_definitions/` (stage 07)
+- Visual element definitions from `visual_element_definitions/` (stage 08)
+- PDF bboxes from `pdf_word_bboxes/` (stage 05, for non-OCR documents)
+
+**Outputs:**
+```
+final_word_bboxes/
+  {doc_id}.json             # Word-level bounding boxes
+final_segment_bboxes/
+  {doc_id}.json             # Line-level bounding boxes
+ocr_results_cache/
+  {doc_id}.json             # Raw OCR API responses (cached)
+logs/ocr/                   # OCR logs and errors
+```
+
+**OCR Decision Logic:**
+```python
+requires_ocr = (
+    has_handwriting(doc_id) or 
+    has_visual_elements(doc_id)
+)
+
+if requires_ocr:
+    perform_microsoft_ocr()
+else:
+    copy_pdf_bboxes()  # Reuse stage 05 results
+```
+
+**Why Handwriting/Visual Elements Require OCR:**
+- Handwriting images inserted in stage 12 → Not in PDF text layer
+- Visual elements inserted in stage 13 → Not in PDF text layer
+- PDF text extraction (stage 05) misses these elements
+- OCR captures all visible text on rendered image
+
+**Microsoft OCR API:**
+- **Service:** Azure Computer Vision (Read API)
+- **Input:** PNG image (from stage 14)
+- **Output:** Word polygons, text, confidence scores
+- **Caching:** Results cached to avoid re-processing
+
+**OCR Response Format:**
+```json
+{
+  "readResults": [{
+    "page": 1,
+    "words": [
+      {
+        "boundingBox": [20, 30, 60, 30, 60, 45, 20, 45],
+        "text": "Invoice",
+        "confidence": 0.98
+      }
+    ],
+    "lines": [
+      {
+        "boundingBox": [20, 30, 95, 30, 95, 45, 20, 45],
+        "text": "Invoice Date",
+        "words": [...]
+      }
+    ]
+  }]
+}
+```
+
+**Coordinate Conversion:**
+OCR returns pixel coordinates; converted to PDF points:
+```python
+pdf_x = ocr_x * (pdf_width / image_width)
+pdf_y = ocr_y * (pdf_height / image_height)
+```
+
+**Segment Aggregation:**
+Groups words into lines based on Y coordinate proximity.
+
+**Key Features:**
+- **Selective OCR:** Only runs OCR when necessary (cost/time savings)
+- **Result Caching:** Avoids redundant API calls
+- **Dual-Level Output:** Word and line (segment) bboxes
+- **Coordinate Conversion:** OCR pixels → PDF points
+- **Fallback:** Copies PDF bboxes for documents without handwriting/VEs
+
+**Configuration:**
+- OCR API key from environment variables
+- Timeout: 30 seconds per request
+- Retry: Up to 3 attempts
+
+**Error Handling:**
+- OCR API failure → Retry, fallback to PDF bboxes if all retries fail
+- Empty OCR result → Warning, uses PDF bboxes
+- Coordinate conversion error → Logged, bbox skipped
+
+---
+
+### Stage 16: Normalize BBoxes
+
+**File:** `pipeline_16_normalize_bboxes.py`
+
+**Purpose:** Convert bounding box coordinates from PDF points (absolute pixels) to normalized [0, 1] coordinates for model training and evaluation.
+
+**Key Functions:**
+- `main()`: Main normalization orchestrator
+- `normalize_word_and_segment_bboxes()`: Normalize word/segment bboxes
+- `normalize_layout_bboxes()`: Normalize layout element bboxes (DLA only)
+- `normalize_coordinates()`: Core coordinate transformation
+
+**Process:**
+1. Load final bboxes from stage 15
+2. Load image dimensions (for normalization denominators)
+3. For each bbox:
+   - Transform: `normalized_x = pixel_x / image_width`
+   - Transform: `normalized_y = pixel_y / image_height`
+   - Preserve text content
+4. Save normalized bboxes
+5. For DLA tasks: also normalize layout element bboxes
+
+**Inputs:**
+- Word bboxes from `final_word_bboxes/` (stage 15)
+- Segment bboxes from `final_segment_bboxes/` (stage 15)
+- Layout element definitions from `layout_element_definitions/` (stage 06, for DLA)
+- Image dimensions from stage 14 logs
+
+**Outputs:**
+```
+normalized_word_bboxes/
+  {doc_id}.json             # Normalized word bboxes
+normalized_segment_bboxes/
+  {doc_id}.json             # Normalized segment bboxes
+normalized_gt/              # For DLA tasks only
+  {doc_id}.json             # Normalized layout elements
+logs/normalization/         # Normalization logs
+```
+
+**Normalization Formula:**
+```python
+normalized_bbox = {
+    "x0": bbox["x0"] / image_width,
+    "y0": bbox["y0"] / image_height,
+    "x1": bbox["x1"] / image_width,
+    "y1": bbox["y1"] / image_height,
+    "text": bbox["text"]  # Preserved
+}
+```
+
+**Normalized BBox Format:**
+```json
+[
+  {
+    "x0": 0.095,    # Was: 20 pixels out of 210mm @ 200dpi
+    "y0": 0.128,    # Was: 30 pixels
+    "x1": 0.286,    # Was: 60 pixels
+    "y1": 0.192,    # Was: 45 pixels
+    "text": "Invoice"
+  }
+]
+```
+
+**DLA-Specific Normalization:**
+For Document Layout Analysis tasks, layout element bboxes are also normalized:
+```json
+[
+  {
+    "label": "title",
+    "bbox": [0.095, 0.128, 0.905, 0.192]  # [x0, y0, x1, y1]
+  },
+  {
+    "label": "text",
+    "bbox": [0.095, 0.213, 0.905, 0.534]
+  }
+]
+```
+
+**Why Normalization:**
+- **Model Training:** Most models expect [0, 1] coordinates
+- **Resolution Independence:** Works across different image sizes
+- **Standard Format:** Matches common dataset formats (e.g., LayoutLM)
+
+**Coordinate System Mapping:**
+```
+PDF Points (72 DPI):
+  595 × 842 points (A4)
+  ↓ (stage 14 conversion @ 200 DPI)
+Image Pixels (200 DPI):
+  1654 × 2339 pixels
+  ↓ (stage 16 normalization)
+Normalized [0, 1]:
+  0.0-1.0 × 0.0-1.0
+```
+
+**Key Features:**
+- **Preserves Text:** Text content unchanged during normalization
+- **Task-Specific:** DLA tasks get additional layout bbox normalization
+- **Validation:** Checks for out-of-bounds coordinates (clamps to [0, 1])
+- **Precision:** Full float precision maintained
+
+**Error Handling:**
+- Out-of-bounds coordinates → Clamped to [0, 1] with warning
+- Missing image dimensions → Error, document skipped
+- Zero image dimensions → Error, document marked invalid
+
+---
+
+### Stage 17: GT Preparation & Verification
+
+**File:** `pipeline_17_gt_preparation_verification.py`
+
+**Purpose:** Validate and prepare final ground truth annotations with fuzzy text matching, task-specific formatting, and comprehensive validation.
+
+**Key Functions:**
+- `main()`: Main verification orchestrator
+- `route_task()`: Route to task-specific handler
+- `handle_qa()`: QA ground truth processing
+- `handle_kie()`: KIE ground truth with BIO tagging
+- `handle_dla()`: DLA ground truth processing
+- `fuzzy_match_text()`: Levenshtein-based text matching
+
+**Process:**
+1. Load raw GT from stage 03/06
+2. Load final word bboxes from stage 15
+3. Route to task-specific handler
+4. Perform fuzzy text matching (GT text → OCR text)
+5. Map GT annotations to bbox indices
+6. Apply task-specific formatting
+7. Validate and save verified GT
+
+**Inputs:**
+- Raw GT from `raw_annotations/` (stage 03/06)
+- Word bboxes from `final_word_bboxes/` (stage 15)
+- Layout elements from `layout_element_definitions/` (stage 06, for DLA)
+- Visual elements from `visual_element_definitions/` (stage 08, for DLA)
+
+**Outputs:**
+```
+verified_gt/
+  qa/
+    {doc_id}.json           # QA format
+  kie/
+    {doc_id}.json           # KIE format with BIO tagging
+  dla/
+    {doc_id}.json           # DLA format with normalized bboxes
+  cls/
+    {doc_id}.json           # Classification format
+logs/gt_verification/       # Verification logs with match statistics
+```
+
+---
+
+#### **QA (Question Answering) Format**
+
+**Process:**
+1. Load QA pairs: `{"question": "answer", ...}`
+2. For each answer:
+   - Find words in bboxes matching answer text (fuzzy)
+   - Record bbox indices of matching words
+3. Save verified GT with bbox mappings
+
+**Output Format:**
+```json
+{
+  "questions": [
+    {
+      "question": "What is the invoice number?",
+      "answer": "INV-12345",
+      "answer_bbox_indices": [15, 16]  # Word indices in final_word_bboxes
+    },
+    {
+      "question": "What is the total amount?",
+      "answer": "$1,234.56",
+      "answer_bbox_indices": [42]
+    }
+  ]
+}
+```
+
+**Fuzzy Matching:**
+Uses Levenshtein distance with 0.85 similarity cutoff:
+```python
+similarity = fuzz.ratio(gt_answer, ocr_text) / 100.0
+if similarity >= 0.85:
+    match_found = True
+```
+
+---
+
+#### **KIE (Key Information Extraction) Format**
+
+**Process:**
+1. Load entity annotations: `{entity_name: text_value, ...}`
+2. For each entity:
+   - Find words matching entity value (fuzzy)
+   - Generate BIO tags for all words
+3. Save verified GT with BIO tagging
+
+**Output Format:**
+```json
+{
+  "entities": [
+    {
+      "entity": "company_name",
+      "value": "Acme Corporation",
+      "bbox_indices": [5, 6]
+    },
+    {
+      "entity": "invoice_date",
+      "value": "January 15, 2024",
+      "bbox_indices": [20, 21, 22]
+    }
+  ],
+  "word_labels": [
+    "O", "O", "O", "O", "O",           # Words 0-4: Outside entities
+    "B-company_name", "I-company_name", # Words 5-6: Company name
+    "O", "O", ...,                      # Words 7-19: Outside
+    "B-invoice_date", "I-invoice_date", "I-invoice_date"  # Words 20-22
+  ]
+}
+```
+
+**BIO Tagging:**
+- `B-entity`: Beginning of entity
+- `I-entity`: Inside entity (continuation)
+- `O`: Outside any entity
+
+---
+
+#### **DLA (Document Layout Analysis) Format**
+
+**Process:**
+1. Load layout element definitions from stage 06
+2. Load visual element definitions from stage 08
+3. Validate labels (must be in `valid_labels`)
+4. Check spatial constraints (no containment, minimal overlap)
+5. Merge visual elements into layout annotations
+6. Normalize bboxes to [0, 1]
+7. Save verified GT
+
+**Output Format:**
+```json
+{
+  "layout_elements": [
+    {
+      "id": "layout_0",
+      "label": "title",
+      "bbox": [0.095, 0.128, 0.905, 0.192]  # Normalized [x0, y0, x1, y1]
+    },
+    {
+      "id": "layout_1",
+      "label": "text",
+      "bbox": [0.095, 0.213, 0.905, 0.534]
+    },
+    {
+      "id": "ve0",
+      "label": "figure",  # Visual element mapped to layout label
+      "bbox": [0.714, 0.895, 0.905, 0.980]
+    }
+  ]
+}
+```
+
+**Visual Element Merging:**
+Visual elements from stage 08 are converted to layout labels:
+- `stamp` → `figure` (or custom mapping)
+- `logo` → `figure`
+- `chart` → `figure`
+- `barcode` → `figure`
+- `photo` → `figure`
+
+**Spatial Validation:**
+- **No containment:** One bbox fully inside another → Error
+- **Minimal overlap:** Overlap area < 5% of smaller bbox → Warning
+- **Valid labels:** All labels must be in `SynDatasetDefinition.valid_labels`
+
+---
+
+#### **CLS (Classification) Format**
+
+**Process:**
+1. Load classification label from raw GT
+2. Validate label against expected classes
+3. Save verified GT
+
+**Output Format:**
+```json
+{
+  "document_class": "invoice",
+  "confidence": 1.0
+}
+```
+
+---
+
+**Fuzzy Matching Details:**
+
+Uses Levenshtein distance (via `fuzz` library) to handle OCR discrepancies:
+```python
+# Example: GT "INV-12345" vs OCR "INV-I2345" (OCR error)
+similarity = fuzz.ratio("INV-12345", "INV-I2345") / 100.0
+# similarity = 0.89 (above 0.85 threshold) → Match!
+```
+
+**Match Statistics Logged:**
+- Total GT annotations
+- Successfully matched annotations
+- Failed matches (similarity < 0.85)
+- Average similarity score
+
+**Key Features:**
+- **Fuzzy Matching:** Handles OCR errors gracefully
+- **Task-Specific Formatting:** QA, KIE, DLA, CLS all handled differently
+- **BIO Tagging:** Automatic generation for KIE
+- **Visual Element Integration:** DLA merges visual elements as layout annotations
+- **Spatial Validation:** Detects overlapping/contained layout elements
+- **Similarity Tracking:** Logs match quality for analysis
+
+**Error Handling:**
+- Match failure (similarity < 0.85) → Logged, annotation skipped
+- Invalid labels (DLA) → Error, document marked invalid
+- Spatial violations (DLA) → Warning, elements flagged
+- Missing bboxes → Error, document marked invalid
+
+---
+
+### Stage 18: Analyze
+
+**File:** `pipeline_18_analyze.py`
+
+**Purpose:** Generate comprehensive statistics, cost analysis, and error categorization for the entire dataset generation process.
+
+**Key Functions:**
+- `main()`: Main analysis orchestrator
+- `calculate_api_costs()`: Compute LLM API costs from token usage
+
+**Process:**
+1. Load all document logs from stages 01-17
+2. Categorize documents: valid vs. invalid
+3. Calculate error distributions
+4. Compute API usage and costs
+5. Generate statistics (handwriting, visual elements, annotations)
+6. Save comprehensive dataset log
+
+**Inputs:**
+- All document logs from previous stages
+- Batch results from stage 02 (for cost calculation)
+- Message processing logs from stage 03
+- Prompt usage statistics
+
+**Outputs:**
+```
+dataset_log.json            # Comprehensive dataset statistics
+logs/analysis/              # Analysis logs
+```
+
+**Dataset Log Structure:**
+```json
+{
+  "metadata": {
+    "syndatadef_name": "docvqa_alpha=1.0",
+    "task": "qa",
+    "total_documents_requested": 1000,
+    "generation_date": "2026-02-07"
+  },
+  
+  "prompting": {
+    "total_prompts": 100,
+    "total_batches": 10,
+    "llm_model": "claude-sonnet-4-20250514"
+  },
+  
+  "total_cost_summary": {
+    "total_cost_usd": 123.45,
+    "input_tokens": 1500000,
+    "output_tokens": 800000,
+    "cached_tokens": 500000,
+    "cost_per_document": 0.12
+  },
+  
+  "valid_samples_stats": {
+    "total_valid": 847,
+    "total_invalid": 153,
+    "validity_rate": 0.847,
+    "avg_handwriting_regions_per_doc": 2.3,
+    "avg_visual_elements_per_doc": 1.5,
+    "avg_annotations_per_doc": 8.7,
+    "documents_with_handwriting": 654,
+    "documents_with_visual_elements": 512
+  },
+  
+  "valid_samples": [
+    {
+      "doc_id": "doc_0001",
+      "seed_image": "docvqa_train_12345",
+      "has_handwriting": true,
+      "has_visual_elements": true,
+      "num_annotations": 10,
+      "num_words": 247,
+      "image_size": [1654, 2339]
+    }
+  ],
+  
+  "valid_samples_by_category": {
+    "invoice": 234,
+    "receipt": 198,
+    "form": 415
+  },
+  
+  "errors": {
+    "multipage_pdf": 12,
+    "missing_ocr_result": 5,
+    "failed_gt_verification": 38,
+    "rendering_timeout": 8,
+    "llm_parsing_error": 23,
+    "bbox_extraction_failed": 4,
+    "handwriting_generation_failed": 7,
+    "visual_element_generation_failed": 3,
+    "other": 53
+  }
+}
+```
+
+**Error Categories:**
+
+| Error Category | Description |
+|---------------|-------------|
+| `multipage_pdf` | PDF rendered with multiple pages (invalid) |
+| `missing_ocr_result` | OCR failed or returned empty result |
+| `failed_gt_verification` | GT matching/validation failed in stage 17 |
+| `rendering_timeout` | PDF rendering exceeded timeout |
+| `llm_parsing_error` | Failed to extract HTML/GT from LLM response |
+| `bbox_extraction_failed` | PyMuPDF failed to extract bboxes |
+| `handwriting_generation_failed` | Diffusion model failed |
+| `visual_element_generation_failed` | VE generation/selection failed |
+| `other` | Miscellaneous errors |
+
+**Cost Calculation:**
+
+**Claude API Pricing (example):**
+```python
+costs = {
+    "input": input_tokens * 0.003 / 1000,      # $3 per 1M tokens
+    "output": output_tokens * 0.015 / 1000,    # $15 per 1M tokens
+    "cached": cached_tokens * 0.0003 / 1000    # $0.30 per 1M tokens (prompt caching)
+}
+total_cost = costs["input"] + costs["output"] + costs["cached"]
+```
+
+**Statistics Computed:**
+- **Validity Rate:** Percentage of documents passing all stages
+- **Handwriting Stats:** Documents with handwriting, avg regions per doc
+- **Visual Element Stats:** Documents with VEs, avg elements per doc
+- **Annotation Stats:** Avg QA pairs/KIE entities/DLA elements per doc
+- **Token Usage:** Input/output/cached token totals
+- **Cost Metrics:** Total cost, cost per document, cost per valid document
+
+**Key Features:**
+- **Comprehensive Error Tracking:** All error categories logged
+- **Cost Transparency:** Token-level cost breakdown
+- **Quality Metrics:** Validity rate, avg annotations, etc.
+- **Category Breakdown:** Valid samples grouped by document type
+- **Per-Document Tracking:** Each valid document's metadata saved
+
+**Usage:**
+This log is essential for:
+- Understanding dataset quality
+- Optimizing pipeline (identify bottlenecks)
+- Cost estimation for future runs
+- Debugging (error distribution analysis)
+
+---
+
+### Stage 19: Create Debug Data
+
+**File:** `pipeline_19_create_debug_data.py`
+
+**Purpose:** Generate comprehensive debug visualizations for manual inspection and quality assurance.
+
+**Key Functions:**
+- `main()`: Main debug generator
+- `visualize_visual_element_bboxes()`: Overlay VE bboxes on PDFs
+- `visualize_final_bboxes_on_images()`: Overlay OCR bboxes on images
+- `visualize_pdf_bboxes()`: Overlay PDF bboxes
+
+**Process:**
+1. Load all generated documents
+2. Create debug subdirectories
+3. For each document:
+   - Generate PDF bbox overlays
+   - Generate VE bbox overlays
+   - Generate handwriting insertion region overlays
+   - Generate final OCR bbox overlays on images
+4. Copy raw HTML with debug.js script for browser inspection
+
+**Inputs:**
+- All intermediate and final outputs from stages 01-18
+
+**Outputs:**
+```
+debug/
+  pdf_bboxes/               # PDF word bboxes overlaid (stage 05)
+    {doc_id}.pdf
+  visual_element_bboxes/    # VE bboxes overlaid (stage 08)
+    {doc_id}.pdf
+  handwriting_insertion/    # Handwriting regions overlaid (stage 12)
+    {doc_id}.pdf
+  final_bboxes_on_images/   # OCR bboxes on final images (stage 15)
+    {doc_id}.png
+  html_with_debug/          # Raw HTML + debug.js
+    {doc_id}.html
+    debug.js                # Browser-based inspection script
+```
+
+**Debug Visualizations:**
+
+**1. PDF BBoxes (Stage 05):**
+- Red rectangles: Word bounding boxes from PyMuPDF
+- Annotated with word text
+- Purpose: Verify PDF text extraction quality
+
+**2. Visual Element BBoxes (Stage 08):**
+- Blue rectangles: Visual element placeholder regions
+- Annotated with element type (stamp, logo, etc.)
+- Purpose: Verify VE extraction and positioning
+
+**3. Handwriting Insertion Regions (Stage 12):**
+- Green rectangles: Handwriting bbox regions
+- Annotated with handwriting text
+- Purpose: Verify handwriting placement accuracy
+
+**4. Final BBoxes on Images (Stage 15):**
+- Orange rectangles: OCR word bounding boxes
+- Overlaid on final rendered images
+- Purpose: Verify OCR accuracy and coverage
+
+**Debug JavaScript (debug.js):**
+```javascript
+// Browser-based inspection tool
+// Features:
+// - Highlight elements on hover
+// - Show geometry data in console
+// - Toggle element visibility
+// - Measure element dimensions
+```
+
+**Key Features:**
+- **Color-Coded Overlays:** Different colors for different stages
+- **Text Annotations:** Bboxes labeled with content
+- **Multi-Format:** Both PDF and PNG visualizations
+- **Browser Inspection:** HTML with interactive debug script
+- **Selective Generation:** Only enabled with `DEBUG_MODE=true` flag
+
+**Configuration:**
+- `DEBUG_MODE`: Enable/disable debug output (default: false)
+- `DEBUG_BBOX_LINE_WIDTH`: Line thickness for overlays (default: 2)
+- `DEBUG_BBOX_OPACITY`: Overlay transparency (default: 0.5)
+
+**When to Use:**
+- Visual quality assurance
+- Debugging bbox extraction issues
+- Verifying handwriting/VE insertion
+- Identifying OCR problems
+- Manual inspection of edge cases
+
+**Performance Note:**
+Debug generation adds ~20% processing time; disabled by default in production.
+
+---
+
+## API Implementation
+
+The DocGenie API provides a FastAPI-based REST service for synchronous document generation, integrating pipeline stages 01-06.
+
+### Architecture Overview
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                         API ARCHITECTURE                         │
+└─────────────────────────────────────────────────────────────────┘
+
+Client Request
+     │
+     ↓
+┌─────────────────────┐
+│   FastAPI Server    │
+│   (main.py)         │
+└─────────────────────┘
+     │
+     ├──► Validate Request (schemas.py)
+     │
+     ├──► Download Seed Images (utils.py)
+     │
+     ├──► Build Prompt (utils.py)
+     │
+     ├──► Call Claude API (Synchronous)
+     │    └──► Claude Sonnet 4.5
+     │
+     ├──► Extract HTML & GT (pipeline_03 functions)
+     │
+     ├──► Render PDF (pipeline_04 functions)
+     │    └──► Playwright/Chromium
+     │
+     ├──► Extract BBoxes (pipeline_05 functions)
+     │    └──► PyMuPDF
+     │
+     └──► Return Response
+          └──► JSON with base64-encoded PDFs
+```
+
+---
+
+### Endpoints
+
+#### **GET /**
+Health check endpoint.
+
+**Response:**
+```json
+{
+  "message": "DocGenie API is running",
+  "version": "1.0",
+  "status": "healthy"
+}
+```
+
+---
+
+#### **GET /health**
+Detailed health status.
+
+**Response:**
+```json
+{
+  "status": "healthy",
+  "api_version": "1.0",
+  "playwright_available": true,
+  "llm_model": "claude-sonnet-4-5-20250929"
+}
+```
+
+---
+
+#### **POST /generate**
+Generate documents with ground truth annotations.
+
+**Request Body** (`schemas.GenerateDocumentRequest`):
+```json
+{
+  "seed_images": [
+    "https://example.com/seed1.jpg",
+    "https://example.com/seed2.jpg"
+  ],
+  "prompt_params": {
+    "language": "English",
+    "doc_type": "business and administrative",
+    "gt_type": "Multiple questions and their answers",
+    "gt_format": "{\"question\": \"answer\", ...}",
+    "num_solutions": 2
+  }
+}
+```
+
+**Request Validation:**
+- `seed_images`: 1-10 URLs (HTTPS only)
+- `num_solutions`: 1-5 documents
+- All prompt parameters required
+
+**Response** (`schemas.GenerateDocumentResponse`):
+```json
+{
+  "success": true,
+  "message": "Successfully generated 2 documents",
+  "documents": [
+    {
+      "document_id": "doc_20260207_001",
+      "html": "<html>...</html>",
+      "css": "body { ... }",
+      "ground_truth": {
+        "What is the invoice number?": "INV-12345"
+      },
+      "pdf_base64": "JVBERi0xLjQK...",
+      "bboxes": [
+        {
+          "x0": 20.0,
+          "y0": 30.0,
+          "x1": 60.0,
+          "y1": 45.0,
+          "text": "Invoice"
+        }
+      ],
+      "page_width_mm": 210.0,
+      "page_height_mm": 297.0
+    }
+  ],
+  "total_documents": 2
+}
+```
+
+---
+
+#### **POST /generate-files**
+Generate documents and return as downloadable files.
+
+**Request:** Same as `/generate`
+
+**Response:** 
+- **Content-Type:** `application/zip`
+- **File:** ZIP archive containing:
+  - `doc_001.pdf`
+  - `doc_001_gt.json`
+  - `doc_001_bboxes.json`
+  - `doc_002.pdf`
+  - ...
+
+**File Structure:**
+```
+generated_documents.zip
+├── doc_001.pdf
+├── doc_001_gt.json
+├── doc_001_bboxes.json
+├── doc_001_metadata.json
+├── doc_002.pdf
+├── ...
+```
+
+---
+
+### Request/Response Schemas
+
+Defined in `api/schemas.py`:
+
+#### **PromptParameters**
+```python
+class PromptParameters(BaseModel):
+    language: str = "English"
+    doc_type: str = "business and administrative"
+    gt_type: str = "Multiple questions and their answers"
+    gt_format: str = '{"question": "answer", ...}'
+    num_solutions: int = Field(default=1, ge=1, le=5)
+```
+
+#### **GenerateDocumentRequest**
+```python
+class GenerateDocumentRequest(BaseModel):
+    seed_images: List[HttpUrl] = Field(..., min_items=1, max_items=10)
+    prompt_params: PromptParameters
+```
+
+#### **BoundingBox**
+```python
+class BoundingBox(BaseModel):
+    x0: float
+    y0: float
+    x1: float
+    y1: float
+    text: str
+```
+
+#### **DocumentResult**
+```python
+class DocumentResult(BaseModel):
+    document_id: str
+    html: str
+    css: str
+    ground_truth: Optional[dict]
+    pdf_base64: str
+    bboxes: List[BoundingBox]
+    page_width_mm: float
+    page_height_mm: float
+```
+
+#### **GenerateDocumentResponse**
+```python
+class GenerateDocumentResponse(BaseModel):
+    success: bool
+    message: str
+    documents: List[DocumentResult]
+    total_documents: int
+```
+
+---
+
+### API Pipeline Flow
+
+Detailed integration with pipeline stages:
+
+```python
+# Simplified API flow (from api/main.py)
+
+@app.post("/generate", response_model=GenerateDocumentResponse)
+async def generate_documents(request: GenerateDocumentRequest):
+    # 1. Download and encode seed images
+    seed_images_base64 = await download_and_encode_images(request.seed_images)
+    
+    # 2. Build prompt from template
+    prompt = build_prompt_from_template(
+        seed_images=seed_images_base64,
+        params=request.prompt_params
+    )
+    
+    # 3. Call Claude API (synchronous, not batched)
+    llm_response = await call_claude_api(
+        prompt=prompt,
+        model="claude-sonnet-4-5-20250929"
+    )
+    
+    # 4. Extract HTML and GT (from pipeline_03)
+    documents_html = extract_html_from_message(llm_response)
+    documents_gt = extract_gt_from_html(documents_html)
+    
+    # 5. Validate HTML (from utils.py)
+    validate_html_structure(documents_html)
+    
+    # 6. Render PDFs (from pipeline_04)
+    pdfs = await render_pdfs_with_playwright(documents_html)
+    
+    # 7. Validate PDFs (from utils.py)
+    validate_pdf_pages(pdfs)
+    
+    # 8. Extract bboxes (from pipeline_05)
+    bboxes = extract_bboxes_from_pdfs(pdfs)
+    
+    # 9. Validate bboxes (from utils.py)
+    validate_bbox_completeness(bboxes)
+    
+    # 10. Encode PDFs to base64
+    pdfs_base64 = encode_pdfs_to_base64(pdfs)
+    
+    # 11. Build response
+    return GenerateDocumentResponse(
+        success=True,
+        documents=[...],
+        total_documents=len(documents_html)
+    )
+```
+
+---
+
+### Integration with Pipeline Functions
+
+**Reused from Pipeline:**
+- `extract_html_from_message()` from `pipeline_03`
+- `extract_gt_from_html()` from `pipeline_03`
+- `render_pdf_with_playwright()` from `pipeline_04`
+- `extract_bboxes_from_pdf()` from `pipeline_05`
+- Various utilities from `docgenie.generation.utils`
+
+**API-Specific Functions (api/utils.py):**
+- `download_seed_images()`: Fetch images from URLs
+- `encode_images_to_base64()`: Convert images for API transmission
+- `build_prompt_from_template()`: Template-based prompt construction
+- `call_claude_api_sync()`: Synchronous Claude API call (non-batched)
+- `encode_pdf_to_base64()`: PDF encoding for response
+- `validate_html_structure()`: HTML validation
+- `validate_pdf_pages()`: PDF page count/size validation
+- `validate_bbox_completeness()`: Ensure bboxes extracted
+
+---
+
+### Configuration
+
+#### **Environment Variables (.env)**
+```bash
+ANTHROPIC_API_KEY=sk-ant-...            # Required for Claude API
+LLM_MODEL=claude-sonnet-4-5-20250929   # Default model
+API_PORT=8000                           # Server port
+DEBUG_MODE=false                        # Enable debug logging
+```
+
+#### **Prompt Templates**
+Located in `data/prompt_templates/`:
+
+**Template Structure:**
+```
+data/prompt_templates/<template_name>/
+├── system_prompt.txt                   # System message
+├── user_prompt_template.txt            # User message template
+└── example_output.html                 # Example for few-shot
+```
+
+**Placeholder Substitution:**
+```python
+# In user_prompt_template.txt
+"""
+Please generate {num_solutions} {doc_type} documents in {language}.
+
+Ground truth format: {gt_format}
+Ground truth type: {gt_type}
+"""
+
+# Substituted with request.prompt_params
+```
+
+#### **CORS Configuration**
+```python
+# In api/main.py
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],        # Open for development
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+```
+
+---
+
+### Authentication & Security
+
+**Current State:**
+- **No built-in authentication:** API is open (for development)
+- **API Key Management:** Claude API key in `.env`, not passed in requests
+- **CORS:** Open (`allow_origins=["*"]`) for development
+
+**Production Recommendations:**
+- Add API key authentication (e.g., Bearer tokens)
+- Restrict CORS origins to known frontends
+- Rate limiting (e.g., Redis-based)
+- Input sanitization for HTML injection prevention
+- HTTPS only (terminate SSL at reverse proxy)
+
+---
+
+### Performance Considerations
+
+**Async Rendering:**
+- Playwright rendering uses async/await
+- Concurrent requests supported by FastAPI
+- Semaphore control prevents resource exhaustion
+
+**Image Size Limits:**
+- Seed images compressed before API transmission
+- Max dimension: 1024px (configurable)
+- JPEG quality: 85% (configurable)
+
+**Timeouts:**
+- PDF render timeout: 60 seconds per document
+- Claude API timeout: 120 seconds
+- Total request timeout: 300 seconds (5 minutes)
+
+**Retry Logic:**
+- PDF rendering: Up to 3 attempts
+- Claude API: Up to 2 retries on network errors
+- No retry on validation failures
+
+**Concurrency:**
+- FastAPI default: Multiple workers (configurable)
+- Playwright: Semaphore-controlled (10 concurrent renders)
+
+---
+
+### Limitations
+
+**Current Limitations:**
+1. **Single-Page Only:** Multi-page PDFs flagged as errors
+2. **Seed Image Limit:** Maximum 10 seed images per request
+3. **Document Limit:** Maximum 5 document variations (`num_solutions`)
+4. **No Handwriting/Visual Elements:** Stages 07-13 not integrated (API stops at stage 06)
+5. **Synchronous LLM:** No batching (higher cost per document)
+6. **No Dataset Export:** No `/export-dataset` endpoint for full pipeline runs
+
+**Known Issues:**
+- Large documents (>10 pages worth of content) may timeout
+- Complex CSS (animations, 3D transforms) may not render correctly
+- Some Unicode characters may not display in PDFs
+
+---
+
+### Future Integration Plan
+
+From `api/PIPELINE_INTEGRATION.md`:
+
+#### **Stage 3: Handwriting & Visual Elements (Stages 07-11)**
+
+**New Request Parameters:**
+```python
+class PromptParameters(BaseModel):
+    # ... existing ...
+    enable_handwriting: bool = False
+    handwriting_ratio: float = Field(default=0.2, ge=0.0, le=1.0)
+    enable_visual_elements: bool = False
+    visual_element_types: List[str] = ["stamp", "logo"]
+```
+
+**New Response Fields:**
+```python
+class DocumentResult(BaseModel):
+    # ... existing ...
+    handwriting_regions: Optional[List[HandwritingRegion]]
+    visual_elements: Optional[List[VisualElement]]
+```
+
+**Impact:**
+- Longer processing time (diffusion model: ~5s per handwriting region)
+- Larger response size (additional images)
+
+---
+
+#### **Stage 4: Image Finalization & OCR (Stages 12-15)**
+
+**New Response Fields:**
+```python
+class DocumentResult(BaseModel):
+    # ... existing ...
+    image_base64: str                    # Final rendered image (PNG)
+    ocr_text: str                        # Full OCR text
+    ocr_confidence: float                # Average OCR confidence
+```
+
+**Impact:**
+- OCR API costs (~$1.50 per 1000 images)
+- Additional 2-3 seconds per document (OCR latency)
+
+---
+
+#### **Stage 5: Dataset Packaging (Stages 16-19)**
+
+**New Endpoint:**
+```python
+@app.post("/export-dataset")
+async def export_dataset(request: ExportDatasetRequest):
+    """
+    Run full pipeline (stages 01-19) and return packaged dataset.
+    """
+    # Run pipeline with syndatadef
+    # Return ZIP with images, GTs, statistics
+```
+
+**Request:**
+```python
+class ExportDatasetRequest(BaseModel):
+    syndatadef_config: dict              # Full SynDatasetDefinition
+    output_format: str = "huggingface"   # "huggingface", "coco", "custom"
+```
+
+**Response:**
+- ZIP archive with full dataset
+- Includes `dataset_log.json` from stage 18
+
+---
+
+### Example Usage
+
+#### **Python Client (api/example_usage.py)**
+
+```python
+import requests
+import base64
+from pathlib import Path
+
+# API endpoint
+API_URL = "http://localhost:8000/generate"
+
+# Prepare request
+request_data = {
+    "seed_images": [
+        "https://example.com/invoice_seed.jpg",
+        "https://example.com/receipt_seed.jpg"
+    ],
+    "prompt_params": {
+        "language": "English",
+        "doc_type": "invoices and receipts",
+        "gt_type": "Multiple questions and their answers",
+        "gt_format": '{"question": "answer", ...}',
+        "num_solutions": 3
+    }
+}
+
+# Call API
+response = requests.post(API_URL, json=request_data)
+result = response.json()
+
+# Process results
+if result["success"]:
+    for doc in result["documents"]:
+        doc_id = doc["document_id"]
+        
+        # Save PDF
+        pdf_data = base64.b64decode(doc["pdf_base64"])
+        Path(f"{doc_id}.pdf").write_bytes(pdf_data)
+        
+        # Save GT
+        Path(f"{doc_id}_gt.json").write_text(
+            json.dumps(doc["ground_truth"], indent=2)
+        )
+        
+        # Save HTML
+        Path(f"{doc_id}.html").write_text(doc["html"])
+        
+        print(f"Saved: {doc_id}")
+        print(f"  - BBoxes: {len(doc['bboxes'])}")
+        print(f"  - GT Annotations: {len(doc['ground_truth'])}")
+```
+
+#### **cURL Example**
+
+```bash
+curl -X POST http://localhost:8000/generate \
+  -H "Content-Type: application/json" \
+  -d '{
+    "seed_images": [
+      "https://example.com/seed.jpg"
+    ],
+    "prompt_params": {
+      "language": "English",
+      "doc_type": "business documents",
+      "gt_type": "Questions and answers",
+      "gt_format": "{\"question\": \"answer\"}",
+      "num_solutions": 1
+    }
+  }'
+```
+
+#### **JavaScript/TypeScript Client**
+
+```typescript
+const response = await fetch('http://localhost:8000/generate', {
+  method: 'POST',
+  headers: { 'Content-Type': 'application/json' },
+  body: JSON.stringify({
+    seed_images: ['https://example.com/seed.jpg'],
+    prompt_params: {
+      language: 'English',
+      doc_type: 'invoices',
+      gt_type: 'Questions and answers',
+      gt_format: '{"question": "answer"}',
+      num_solutions: 2
+    }
+  })
+});
+
+const result = await response.json();
+
+// Decode PDF
+const pdfBlob = new Blob(
+  [Uint8Array.from(atob(result.documents[0].pdf_base64), c => c.charCodeAt(0))],
+  { type: 'application/pdf' }
+);
+
+// Download PDF
+const url = URL.createObjectURL(pdfBlob);
+const a = document.createElement('a');
+a.href = url;
+a.download = `${result.documents[0].document_id}.pdf`;
+a.click();
+```
+
+---
+
+### Testing
+
+**Test File:** `api/test_api.py`
+
+```python
+import pytest
+from fastapi.testclient import TestClient
+from api.main import app
+
+client = TestClient(app)
+
+def test_health_check():
+    response = client.get("/health")
+    assert response.status_code == 200
+    assert response.json()["status"] == "healthy"
+
+def test_generate_documents():
+    request_data = {
+        "seed_images": ["https://example.com/seed.jpg"],
+        "prompt_params": {
+            "language": "English",
+            "doc_type": "invoices",
+            "gt_type": "Questions",
+            "gt_format": "{\"q\": \"a\"}",
+            "num_solutions": 1
+        }
+    }
+    response = client.post("/generate", json=request_data)
+    assert response.status_code == 200
+    result = response.json()
+    assert result["success"] is True
+    assert len(result["documents"]) > 0
+
+def test_invalid_request():
+    request_data = {
+        "seed_images": [],  # Invalid: empty
+        "prompt_params": {"num_solutions": 10}  # Invalid: > 5
+    }
+    response = client.post("/generate", json=request_data)
+    assert response.status_code == 422  # Validation error
+```
+
+**Run Tests:**
+```bash
+cd /media/ahad-hassan/Volume_E/FYP/FYP/docgenie
+pytest api/test_api.py -v
+```
+
+---
+
+### Deployment
+
+**Start Server:**
+```bash
+cd /media/ahad-hassan/Volume_E/FYP/FYP/docgenie/api
+chmod +x start.sh
+./start.sh
+```
+
+**start.sh Contents:**
+```bash
+#!/bin/bash
+export $(cat .env | xargs)
+uvicorn main:app --host 0.0.0.0 --port 8000 --reload
+```
+
+**Production Deployment:**
+```bash
+# With Gunicorn (multi-worker)
+gunicorn main:app \
+  --workers 4 \
+  --worker-class uvicorn.workers.UvicornWorker \
+  --bind 0.0.0.0:8000 \
+  --timeout 300
+```
+
+**Docker Deployment:**
+```dockerfile
+FROM python:3.10-slim
+
+WORKDIR /app
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+
+# Install Playwright browsers
+RUN playwright install chromium
+RUN playwright install-deps
+
+COPY . .
+
+CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
+```
+
+---
+
+## Core Models & Utilities
+
+### SynDatasetDefinition Model
+
+**File:** `docgenie/generation/models/_syndatadef.py`
+
+**Purpose:** Configuration object for synthetic dataset generation, loaded from YAML files.
+
+**Key Attributes:**
+```python
+@dataclass
+class SynDatasetDefinition:
+    # Dataset metadata
+    name: str                           # "docvqa_alpha=1.0"
+    task: TaskType                      # TaskType.QA, .KIE, .DLA, .CLS
+    base_dataset: str                   # "docvqa", "cord", etc.
+    
+    # LLM configuration
+    llm_model: str                      # "claude-sonnet-4-20250514"
+    prompt_template: str                # "DocGenie"
+    num_solutions: int                  # Documents per prompt
+    
+    # Prompting parameters
+    language: str                       # "English", "German", etc.
+    doc_type: str                       # "business and administrative"
+    gt_type: str                        # Task-specific GT description
+    gt_format: str                      # Expected GT format
+    
+    # Dataset parameters
+    num_samples: int                    # Total documents to generate
+    alpha: float                        # Clustering diversity parameter
+    
+    # Seed selection
+    seeds_per_cluster: int              # Seeds sampled per cluster
+    clustering_method: str              # "kmeans", "agglomerative"
+    
+    # Task-specific
+    valid_labels: Optional[List[str]]   # For DLA: ["title", "text", ...]
+    
+    # Output paths
+    output_dir: Path                    # Root output directory
+```
+
+**Key Methods:**
+```python
+def get_file_structure(self) -> FileStructure:
+    """Returns FileStructure manager for output directories."""
+
+def build_prompt(self, seed_images: List[str]) -> str:
+    """Builds prompt from template with parameter substitution."""
+
+def iter_document_logs(self) -> Iterator[DocumentLog]:
+    """Iterates over all document logs."""
+
+def update_document_status(self, doc_id: str, status: Status):
+    """Updates document status in log."""
+
+@classmethod
+def from_yaml(cls, yaml_path: Path) -> "SynDatasetDefinition":
+    """Load configuration from YAML file."""
+```
+
+**Example YAML:**
+```yaml
+name: "docvqa_alpha=1.0"
+task: "qa"
+base_dataset: "docvqa"
+
+llm_model: "claude-sonnet-4-20250514"
+prompt_template: "DocGenie"
+num_solutions: 1
+
+language: "English"
+doc_type: "business and administrative documents"
+gt_type: "Multiple questions and their answers"
+gt_format: '{"question": "answer", ...}'
+
+num_samples: 1000
+alpha: 1.0
+seeds_per_cluster: 10
+clustering_method: "kmeans"
+
+output_dir: "data/datasets/synthesized_datasets/docvqa_alpha=1.0"
+```
+
+---
+
+### PipelineParameters Model
+
+**File:** `docgenie/generation/models/_pipeline.py`
+
+**Purpose:** Runtime parameters for pipeline execution.
+
+**Attributes:**
+```python
+@dataclass
+class PipelineParameters:
+    # Execution control
+    start_stage: int = 1                # First stage to execute
+    end_stage: int = 19                 # Last stage to execute
+    skip_existing: bool = True          # Skip documents with existing outputs
+    
+    # Parallelization
+    max_workers: int = 10               # Concurrent processing
+    chromium_concurrency: int = 10      # Parallel PDF renders
+    
+    # Debug mode
+    debug_mode: bool = False            # Enable debug visualizations
+    
+    # Retry configuration
+    max_retries: int = 3                # Retry attempts
+    retry_delay: float = 2.0            # Seconds between retries
+    
+    # Timeouts
+    pdf_render_timeout: int = 60        # Seconds
+    ocr_timeout: int = 30               # Seconds
+```
+
+---
+
+### FileStructure Model
+
+**File:** `docgenie/generation/models/_file.py`
+
+**Purpose:** Manages directory structure for generated data.
+
+**Key Properties:**
+```python
+@dataclass
+class FileStructure:
+    root: Path                          # Root output directory
+    
+    # Directory properties (all return Path)
+    seeds_directory: Path
+    prompt_batches_directory: Path
+    message_results_directory: Path
+    raw_html_directory: Path
+    raw_annotations_directory: Path
+    geometries_directory: Path
+    pdf_initial_directory: Path
+    render_html_directory: Path
+    pdf_word_bboxes_directory: Path
+    pdf_char_bboxes_directory: Path
+    layout_element_definitions_directory: Path
+    handwriting_definitions_directory: Path
+    visual_element_definitions_directory: Path
+    handwriting_images_directory: Path
+    visual_element_images_directory: Path
+    pdf_without_handwriting_placeholder_directory: Path
+    pdf_with_handwriting_directory: Path
+    pdf_final_directory: Path
+    images_directory: Path
+    final_word_bboxes_directory: Path
+    final_segment_bboxes_directory: Path
+    normalized_word_bboxes_directory: Path
+    normalized_segment_bboxes_directory: Path
+    verified_gt_directory: Path
+    
+    # Debug subdirectories
+    debug_directory: Path
+    debug_pdf_bboxes_directory: Path
+    debug_visual_element_bboxes_directory: Path
+    debug_handwriting_insertion_directory: Path
+    debug_final_bboxes_on_images_directory: Path
+    debug_html_with_debug_directory: Path
+```
+
+**Key Methods:**
+```python
+def create_all_directories(self):
+    """Create all output directories."""
+
+def get_document_path(self, doc_id: str, stage: str) -> Path:
+    """Get path for specific document and stage."""
+```
+
+---
+
+### DocumentLog Model
+
+**File:** `docgenie/generation/models/_log.py`
+
+**Purpose:** Document-level metadata and status tracking.
+
+**Attributes:**
+```python
+@dataclass
+class DocumentLog:
+    doc_id: str                         # Unique document ID
+    seed_image_id: str                  # Source seed image
+    prompt_call_id: str                 # Prompt batch/call ID
+    
+    status: Status                      # VALID, INVALID, PROCESSING
+    
+    # Stage completion flags
+    has_raw_html: bool = False
+    has_raw_gt: bool = False
+    has_pdf_initial: bool = False
+    has_geometries: bool = False
+    has_bboxes: bool = False
+    has_handwriting: bool = False
+    has_visual_elements: bool = False
+    has_final_image: bool = False
+    has_ocr_result: bool = False
+    has_verified_gt: bool = False
+    
+    # Statistics
+    num_words: int = 0
+    num_annotations: int = 0
+    num_handwriting_regions: int = 0
+    num_visual_elements: int = 0
+    
+    # Error tracking
+    error_stage: Optional[str] = None
+    error_message: Optional[str] = None
+    error_category: Optional[str] = None
+    
+    # Timestamps
+    created_at: datetime
+    updated_at: datetime
+```
+
+**Key Methods:**
+```python
+def mark_stage_complete(self, stage: str):
+    """Mark pipeline stage as complete."""
+
+def mark_error(self, stage: str, error_msg: str, category: str):
+    """Record error and mark document as invalid."""
+
+def is_valid(self) -> bool:
+    """Check if document passed all stages."""
+```
+
+---
+
+### BBox Model
+
+**File:** `docgenie/generation/models/_bbox.py`
+
+**Purpose:** Bounding box representation with text content.
+
+**Attributes:**
+```python
+@dataclass
+class BBox:
+    rect: Rect                          # {x0, y0, x1, y1}
+    text: str                           # Text content
+    metadata: Optional[dict] = None     # Additional data
+```
+
+**Key Methods:**
+```python
+@property
+def width(self) -> float:
+    return self.rect["x1"] - self.rect["x0"]
+
+@property
+def height(self) -> float:
+    return self.rect["y1"] - self.rect["y0"]
+
+def normalize(self, image_width: float, image_height: float) -> "BBox":
+    """Convert to normalized [0, 1] coordinates."""
+
+def unnormalize(self, image_width: float, image_height: float) -> "BBox":
+    """Convert from normalized to pixel coordinates."""
+
+def to_dict(self) -> dict:
+    """Serialize to dictionary."""
+
+@classmethod
+def from_dict(cls, data: dict) -> "BBox":
+    """Deserialize from dictionary."""
+```
+
+---
+
+### LayoutBBox Model
+
+**File:** `docgenie/generation/models/_bbox.py`
+
+**Purpose:** Layout element bounding box (for DLA tasks).
+
+**Attributes:**
+```python
+@dataclass
+class LayoutBBox:
+    label: str                          # "title", "text", "table", etc.
+    rect: Rect                          # {x0, y0, x1, y1}
+    metadata: Optional[dict] = None
+```
+
+**Key Methods:**
+```python
+def normalize(self, image_width: float, image_height: float) -> "LayoutBBox":
+    """Normalize coordinates."""
+
+def contains(self, other: "LayoutBBox") -> bool:
+    """Check if this bbox fully contains another."""
+
+def overlaps(self, other: "LayoutBBox") -> bool:
+    """Check if this bbox overlaps with another."""
+
+def overlap_area(self, other: "LayoutBBox") -> float:
+    """Calculate overlap area with another bbox."""
+```
+
+---
+
+### Utility Modules
+
+#### **BBox Utilities (utils/bboxes.py)**
+
+```python
+def load_bboxes_from_file(file_path: Path) -> List[BBox]:
+    """Load bboxes from JSON file."""
+
+def save_bboxes_to_file(bboxes: List[BBox], file_path: Path):
+    """Save bboxes to JSON file."""
+
+def visualize_bboxes_on_pdf(
+    pdf_path: Path,
+    bboxes: List[BBox],
+    output_path: Path,
+    color: str = "red"
+):
+    """Draw bbox overlays on PDF."""
+
+def visualize_bboxes_on_image(
+    image_path: Path,
+    bboxes: List[BBox],
+    output_path: Path,
+    color: str = "orange"
+):
+    """Draw bbox overlays on image."""
+
+def check_bbox_containment(bbox1: BBox, bbox2: BBox) -> bool:
+    """Check if bbox1 contains bbox2."""
+```
+
+---
+
+#### **Geometry Utilities (utils/geos.py)**
+
+```python
+def filter_layout_elements(geometries: dict) -> List[dict]:
+    """Extract elements with data-label attribute."""
+
+def filter_handwriting_elements(geometries: dict) -> List[dict]:
+    """Extract elements with data-handwriting attribute."""
+
+def filter_visual_elements(geometries: dict) -> List[dict]:
+    """Extract elements with data-visual-element attribute."""
+
+def filter_by_css_class(geometries: dict, class_name: str) -> List[dict]:
+    """Extract elements with specific CSS class."""
+```
+
+---
+
+#### **Serialization Utilities (utils/serialization.py)**
+
+```python
+def encode_image_to_base64(image_path: Path) -> str:
+    """Encode image file to base64 string."""
+
+def decode_base64_to_image(base64_str: str, output_path: Path):
+    """Decode base64 string and save as image."""
+
+def serialize_dataclass(obj: Any) -> dict:
+    """Serialize dataclass to dictionary."""
+
+def deserialize_dataclass(data: dict, cls: Type[T]) -> T:
+    """Deserialize dictionary to dataclass."""
+```
+
+---
+
+## Configuration & Constants
+
+### Key Constants (generation/constants.py)
+
+```python
+# Document processing
+HTML_PARSER = "html.parser"           # BeautifulSoup parser
+PDF_POINT_SCALING = 72 / 96           # CSS DPI to PDF DPI
+
+# Bounding boxes
+BBOX_OVERLAP_THRESHOLD = 0.05         # 5% overlap tolerance
+SPATIAL_MATCH_THRESHOLD = 10.0        # Pixel tolerance for matching
+
+# PDF rendering
+CHROMIUM_CONCURRENCY = 10             # Parallel renders
+PER_PDF_RENDER_TIMEOUT = 60           # Seconds
+PER_PDF_RENDER_MAX_RETRIES = 3        # Retry attempts
+
+# Handwriting
+MAX_HANDWRITING_CHARS = 7             # Max chars per diffusion generation
+HANDWRITING_HEIGHT_PX = 40            # Image height
+HANDWRITING_PADDING_PX = 0            # Horizontal padding
+DIFFUSION_NUM_INFERENCE_STEPS = 50    # Generation quality
+HANDWRITING_IMAGE_UPSCALE_FACTOR = 3  # Insertion scaling
+MAX_HANDWRITING_RAND_X = 2            # Random X offset (pixels)
+MAX_HANDWRITING_RAND_Y = 1            # Random Y offset (pixels)
+
+# Visual elements
+VISUAL_ELEMENT_UPSCALE_FACTOR = 3     # Insertion scaling
+STAMP_BORDER_WIDTH = 2                # Stamp border thickness
+BARCODE_DPI = 300                     # Barcode image quality
+
+# OCR
+IMAGE_DPI = 200                       # Final image DPI
+OCR_CONFIDENCE_THRESHOLD = 0.8        # Min confidence
+
+# Ground truth
+FUZZY_MATCH_THRESHOLD = 0.85          # Levenshtein similarity cutoff
+
+# Handwriting styles
+HANDWRITING_STYLES = [
+    "writer_0", "writer_1", "writer_2", ..., "writer_99"
+]
+
+# Visual element types
+VISUAL_ELEMENT_TYPES = [
+    "stamp", "logo", "barcode", "chart", "photo"
+]
+
+# Visual element type mapping (LLM output → standard)
+VISUAL_ELEMENT_TYPE_MAPPING = {
+    "stamp": "stamp",
+    "company_stamp": "stamp",
+    "approval_stamp": "stamp",
+    "logo": "logo",
+    "company_logo": "logo",
+    "brand_logo": "logo",
+    "barcode": "barcode",
+    "code128": "barcode",
+    "chart": "chart",
+    "graph": "chart",
+    "figure": "chart",
+    "photo": "photo",
+    "image": "photo",
+    "picture": "photo"
+}
+```
+
+---
+
+### Environment Variables
+
+**Required:**
+```bash
+ANTHROPIC_API_KEY=sk-ant-...          # Claude API key
+MICROSOFT_AZURE_OCR_KEY=...           # Azure OCR key
+MICROSOFT_AZURE_OCR_ENDPOINT=...      # Azure OCR endpoint
+```
+
+**Optional:**
+```bash
+LLM_MODEL=claude-sonnet-4-20250514    # Override model
+DEBUG_MODE=false                      # Enable debug output
+LOG_LEVEL=INFO                        # Logging verbosity
+CHROMIUM_CONCURRENCY=10               # Override concurrency
+```
+
+---
+
+## Error Handling & Debugging
+
+### Error Categories
+
+**Comprehensive error tracking in stage 18:**
+
+| Category | Description | Resolution |
+|----------|-------------|------------|
+| `multipage_pdf` | PDF rendered with >1 page | Check HTML content size, CSS page breaks |
+| `missing_ocr_result` | OCR API failed or empty | Check Azure credentials, retry |
+| `failed_gt_verification` | GT text not found in OCR | Review fuzzy match threshold, inspect HTML |
+| `rendering_timeout` | PDF render exceeded timeout | Increase timeout, simplify HTML |
+| `llm_parsing_error` | Failed to extract HTML/GT | Review LLM response format, update regex |
+| `bbox_extraction_failed` | PyMuPDF extraction error | Check PDF validity, inspect fonts |
+| `handwriting_generation_failed` | Diffusion model error | Check model checkpoint, GPU availability |
+| `visual_element_generation_failed` | VE creation error | Check prefab directories, image validity |
+
+---
+
+### Debug Visualizations (Stage 19)
+
+**Generated debug outputs:**
+
+1. **PDF BBoxes:** Verify PyMuPDF extraction quality
+2. **Visual Element BBoxes:** Verify VE extraction and positioning
+3. **Handwriting Insertion:** Verify handwriting placement
+4. **Final BBoxes on Images:** Verify OCR accuracy
+
+**Enable debug mode:**
+```python
+# In pipeline execution
+pipeline_params = PipelineParameters(
+    debug_mode=True,
+    # ... other params
+)
+```
+
+---
+
+### Logging
+
+**Log locations:**
+```
+data/datasets/synthesized_datasets/<dataset_name>/logs/
+├── pipeline_01/
+│   └── seed_selection.log
+├── pipeline_02/
+│   └── prompting.log
+├── pipeline_03/
+│   └── message_processing/
+│       ├── batch_001.log
+│       └── batch_002.log
+├── ...
+└── pipeline_19/
+    └── debug_generation.log
+```
+
+**Log format:**
+```
+[2026-02-07 14:32:15] [INFO] [pipeline_04] Starting PDF rendering for doc_0001
+[2026-02-07 14:32:18] [INFO] [pipeline_04] Successfully rendered doc_0001 (3.2s)
+[2026-02-07 14:32:18] [ERROR] [pipeline_04] Rendering failed for doc_0002: Timeout
+```
+
+---
+
+### Common Issues & Solutions
+
+**Issue: Multi-page PDFs**
+- **Cause:** HTML content exceeds page size
+- **Solution:** Reduce content, check for long tables, use CSS `overflow: hidden`
+
+**Issue: Handwriting not appearing**
+- **Cause:** Character bboxes not available (stage 05)
+- **Solution:** Use simpler fonts in HTML, ensure PyMuPDF can extract chars
+
+**Issue: OCR missing text**
+- **Cause:** Low image DPI, poor contrast
+- **Solution:** Increase `IMAGE_DPI` (stage 14), adjust HTML styling
+
+**Issue: GT verification failures**
+- **Cause:** OCR discrepancies, fuzzy match threshold too high
+- **Solution:** Lower `FUZZY_MATCH_THRESHOLD`, improve HTML text rendering
+
+**Issue: Claude API timeout**
+- **Cause:** Large seed images, complex prompts
+- **Solution:** Compress seed images, simplify prompt template
+
+---
+
+## Usage Examples
+
+### Full Pipeline Execution
+
+```python
+from docgenie.generation import pipeline_01_select_seeds
+from docgenie.generation.models import SynDatasetDefinition, PipelineParameters
+
+# Load configuration
+syndatadef = SynDatasetDefinition.from_yaml(
+    "data/syn_dataset_definitions/docvqa_alpha=1.0.yaml"
+)
+
+# Configure pipeline
+params = PipelineParameters(
+    start_stage=1,
+    end_stage=19,
+    skip_existing=True,
+    debug_mode=False,
+    max_workers=10
+)
+
+# Execute pipeline stages
+for stage in range(params.start_stage, params.end_stage + 1):
+    print(f"Executing stage {stage:02d}...")
+    
+    # Import and run stage module
+    stage_module = importlib.import_module(
+        f"docgenie.generation.pipeline_{stage:02d}_*"
+    )
+    stage_module.main(syndatadef, params)
+    
+    print(f"Stage {stage:02d} complete.")
+
+print("Pipeline execution complete!")
+```
+
+---
+
+### API Usage
+
+**See API Implementation section for detailed examples.**
+
+---
+
+### Custom Dataset Definition
+
+```yaml
+# data/syn_dataset_definitions/custom_invoices.yaml
+
+name: "custom_invoices_v1"
+task: "kie"
+base_dataset: "cord"
+
+llm_model: "claude-sonnet-4-20250514"
+prompt_template: "ClaudeRefined12"
+num_solutions: 1
+
+language: "English"
+doc_type: "invoices and receipts"
+gt_type: "Key-value pairs (entity extraction)"
+gt_format: '{"entity_name": "entity_value", ...}'
+
+num_samples: 500
+alpha: 0.75
+seeds_per_cluster: 5
+clustering_method: "kmeans"
+
+# KIE-specific
+valid_labels: null  # Not used for KIE
+
+output_dir: "data/datasets/synthesized_datasets/custom_invoices_v1"
+```
+
+---
+
+## Conclusion
+
+This documentation provides a comprehensive overview of the DocGenie generation pipeline and API. For additional support or questions, refer to:
+
+- **Pipeline Integration Guide:** `api/PIPELINE_INTEGRATION.md`
+- **API README:** `api/README.md`
+- **Source Code:** `docgenie/generation/` and `api/`
+
+**Key Takeaways:**
+
+1. **19-Stage Pipeline:** Modular design from seed selection to GT verification
+2. **Multi-Task Support:** QA, KIE, DLA, CLS with task-specific handling
+3. **Realistic Documents:** LLM-generated content, diffusion handwriting, visual elements
+4. **Quality Assurance:** Comprehensive validation, OCR verification, error tracking
+5. **API Integration:** FastAPI service for synchronous document generation (stages 01-06)
+6. **Extensibility:** Modular code, clear interfaces, easy to extend
+
+**Pipeline Strengths:**
+
+- Task-agnostic core with task-specific adapters
+- Extensive logging and error tracking
+- Parallel processing where applicable
+- Debug visualizations at every stage
+- Integration with state-of-the-art models
+
+**Future Directions:**
+
+- Full API integration (stages 07-19)
+- Additional document types (forms, legal, medical)
+- Multi-language support expansion
+- Enhanced visual element generation (charts, diagrams)
+- Real-time generation optimization