historical-ocr / docs /preprocessing.md
milwright's picture
Consolidate segmentation improvements and code cleanup
42dc069

A newer version of the Streamlit SDK is available: 1.45.1

Upgrade

Image Preprocessing for Historical Document OCR

This document outlines the enhanced preprocessing capabilities for improving OCR quality on historical documents, including deskewing, thresholding, and morphological operations.

Overview

The preprocessing pipeline offers several options to enhance image quality before OCR processing:

  1. Deskewing: Automatically detects and corrects document skew using multiple detection algorithms
  2. Thresholding: Converts grayscale images to binary using adaptive or Otsu methods with pre-blur options
  3. Morphological Operations: Cleans up binary images by removing noise or filling in gaps
  4. Document-Type Specific Settings: Customized preprocessing configurations for different document types

Configuration

Preprocessing options are set in config.py and are tunable per document type. All settings are accessible through environment variables for easy deployment configuration.

Deskewing

"deskew": {
    "enabled": True/False,              # Whether to apply deskewing
    "angle_threshold": 0.1,             # Minimum angle (degrees) to trigger deskewing
    "max_angle": 45.0,                  # Maximum correction angle
    "use_hough": True/False,            # Use Hough transform in addition to minAreaRect
    "consensus_method": "average",      # How to combine angle estimations
    "fallback": {"enabled": True/False} # Fall back to original if deskewing fails
}

Deskewing uses two methods:

  • minAreaRect: Finds contours in the binary image and calculates their orientation
  • Hough Transform: Detects lines in the image and their angles

The consensus_method can be:

  • "average": Average of all detected angles (most stable)
  • "median": Median of all angles (robust to outliers)
  • "min": Minimum absolute angle (most conservative)
  • "max": Maximum absolute angle (most aggressive)

Thresholding

"thresholding": {
    "method": "adaptive",               # "none", "otsu", or "adaptive"
    "adaptive_block_size": 11,          # Block size for adaptive thresholding (must be odd)
    "adaptive_constant": 2,             # Constant subtracted from mean
    "otsu_gaussian_blur": 1,            # Blur kernel size for Otsu pre-processing
    "preblur": {
        "enabled": True/False,          # Whether to apply pre-blur
        "method": "gaussian",           # "gaussian" or "median"
        "kernel_size": 3                # Blur kernel size (must be odd)
    },
    "fallback": {"enabled": True/False} # Fall back to grayscale if thresholding fails
}

Thresholding methods:

  • Otsu: Automatically determines optimal global threshold (best for high-contrast documents)
  • Adaptive: Calculates thresholds for different regions (better for uneven lighting, historical documents)

Morphological Operations

"morphology": {
    "enabled": True/False,              # Whether to apply morphological operations
    "operation": "close",               # "open", "close", "both"
    "kernel_size": 1,                   # Size of the structuring element
    "kernel_shape": "rect"              # "rect", "ellipse", "cross"
}

Morphological operations:

  • Open: Erosion followed by dilation - removes small noise and disconnects thin connections
  • Close: Dilation followed by erosion - fills small holes and connects broken elements
  • Both: Applies opening followed by closing

Document Type Configurations

The system includes optimized settings for different document types:

"document_types": {
    "standard": {
        # Default settings - will use the global settings
    },
    "newspaper": {
        "deskew": {"enabled": True, "angle_threshold": 0.3, "max_angle": 10.0},
        "thresholding": {
            "method": "adaptive", 
            "adaptive_block_size": 15,
            "adaptive_constant": 3,
            "preblur": {"method": "gaussian", "kernel_size": 3}
        },
        "morphology": {"operation": "close", "kernel_size": 1}
    },
    "handwritten": {
        "deskew": {"enabled": True, "angle_threshold": 0.5, "use_hough": False},
        "thresholding": {
            "method": "adaptive", 
            "adaptive_block_size": 31, 
            "adaptive_constant": 5,
            "preblur": {"method": "median", "kernel_size": 3}
        },
        "morphology": {"operation": "open", "kernel_size": 1}
    },
    "book": {
        "deskew": {"enabled": True},
        "thresholding": {
            "method": "otsu",
            "preblur": {"method": "gaussian", "kernel_size": 5}
        },
        "morphology": {"operation": "both", "kernel_size": 1}
    }
}

Performance and Logging

"performance": {
    "parallel": {
        "enabled": True/False,          # Whether to use parallel processing
        "max_workers": 4                # Maximum number of worker threads
    },
    "timeout_ms": 10000                 # Timeout for preprocessing (in milliseconds)
}

"logging": {
    "enabled": True/False,              # Whether to log preprocessing metrics
    "metrics": ["skew_angle", "binary_nonzero_pct", "processing_time"],
    "output_path": "logs/preprocessing_metrics.json"
}

Usage with OCR Processing

When processing documents, simply specify the document type:

preprocessing_options = {
    "document_type": "newspaper",  # Use newspaper-optimized settings
    "grayscale": True,             # Legacy option: apply grayscale conversion
    "denoise": True,               # Legacy option: apply denoising
    "contrast": 10,                # Legacy option: adjust contrast (0-100)
    "rotation": 0                  # Legacy option: manual rotation (degrees)
}

# Apply preprocessing and OCR
result = process_file(file_bytes, file_ext, preprocessing_options=preprocessing_options)

Visual Examples

Original Document

[A historical newspaper or document image would be shown here]

After Deskewing

[The same document, with skew corrected]

After Thresholding

[The document converted to binary with clear text]

After Morphological Operations

[The binary image with small noise removed and/or gaps filled]

Troubleshooting

Poor Deskewing Results

  • Symptom: Document skew is not correctly detected or corrected
  • Solution: Try adjusting angle_threshold or max_angle, or disable Hough transform for handwritten documents

Thresholding Issues

  • Symptom: Text is lost or background noise is excessive after thresholding
  • Solution: Try changing the thresholding method or adjusting adaptive_block_size and adaptive_constant

Performance Concerns

  • Symptom: Processing is too slow for large documents
  • Solution: Enable parallel processing, reduce image size, or disable some preprocessing steps for faster results