historical-ocr / docs /preprocessing.md
milwright's picture
Consolidate segmentation improvements and code cleanup
42dc069
# Image Preprocessing for Historical Document OCR
This document outlines the enhanced preprocessing capabilities for improving OCR quality on historical documents, including deskewing, thresholding, and morphological operations.
## Overview
The preprocessing pipeline offers several options to enhance image quality before OCR processing:
1. **Deskewing**: Automatically detects and corrects document skew using multiple detection algorithms
2. **Thresholding**: Converts grayscale images to binary using adaptive or Otsu methods with pre-blur options
3. **Morphological Operations**: Cleans up binary images by removing noise or filling in gaps
4. **Document-Type Specific Settings**: Customized preprocessing configurations for different document types
## Configuration
Preprocessing options are set in `config.py` and are tunable per document type. All settings are accessible through environment variables for easy deployment configuration.
### Deskewing
```python
"deskew": {
"enabled": True/False, # Whether to apply deskewing
"angle_threshold": 0.1, # Minimum angle (degrees) to trigger deskewing
"max_angle": 45.0, # Maximum correction angle
"use_hough": True/False, # Use Hough transform in addition to minAreaRect
"consensus_method": "average", # How to combine angle estimations
"fallback": {"enabled": True/False} # Fall back to original if deskewing fails
}
```
Deskewing uses two methods:
- **minAreaRect**: Finds contours in the binary image and calculates their orientation
- **Hough Transform**: Detects lines in the image and their angles
The `consensus_method` can be:
- `"average"`: Average of all detected angles (most stable)
- `"median"`: Median of all angles (robust to outliers)
- `"min"`: Minimum absolute angle (most conservative)
- `"max"`: Maximum absolute angle (most aggressive)
### Thresholding
```python
"thresholding": {
"method": "adaptive", # "none", "otsu", or "adaptive"
"adaptive_block_size": 11, # Block size for adaptive thresholding (must be odd)
"adaptive_constant": 2, # Constant subtracted from mean
"otsu_gaussian_blur": 1, # Blur kernel size for Otsu pre-processing
"preblur": {
"enabled": True/False, # Whether to apply pre-blur
"method": "gaussian", # "gaussian" or "median"
"kernel_size": 3 # Blur kernel size (must be odd)
},
"fallback": {"enabled": True/False} # Fall back to grayscale if thresholding fails
}
```
Thresholding methods:
- **Otsu**: Automatically determines optimal global threshold (best for high-contrast documents)
- **Adaptive**: Calculates thresholds for different regions (better for uneven lighting, historical documents)
### Morphological Operations
```python
"morphology": {
"enabled": True/False, # Whether to apply morphological operations
"operation": "close", # "open", "close", "both"
"kernel_size": 1, # Size of the structuring element
"kernel_shape": "rect" # "rect", "ellipse", "cross"
}
```
Morphological operations:
- **Open**: Erosion followed by dilation - removes small noise and disconnects thin connections
- **Close**: Dilation followed by erosion - fills small holes and connects broken elements
- **Both**: Applies opening followed by closing
### Document Type Configurations
The system includes optimized settings for different document types:
```python
"document_types": {
"standard": {
# Default settings - will use the global settings
},
"newspaper": {
"deskew": {"enabled": True, "angle_threshold": 0.3, "max_angle": 10.0},
"thresholding": {
"method": "adaptive",
"adaptive_block_size": 15,
"adaptive_constant": 3,
"preblur": {"method": "gaussian", "kernel_size": 3}
},
"morphology": {"operation": "close", "kernel_size": 1}
},
"handwritten": {
"deskew": {"enabled": True, "angle_threshold": 0.5, "use_hough": False},
"thresholding": {
"method": "adaptive",
"adaptive_block_size": 31,
"adaptive_constant": 5,
"preblur": {"method": "median", "kernel_size": 3}
},
"morphology": {"operation": "open", "kernel_size": 1}
},
"book": {
"deskew": {"enabled": True},
"thresholding": {
"method": "otsu",
"preblur": {"method": "gaussian", "kernel_size": 5}
},
"morphology": {"operation": "both", "kernel_size": 1}
}
}
```
## Performance and Logging
```python
"performance": {
"parallel": {
"enabled": True/False, # Whether to use parallel processing
"max_workers": 4 # Maximum number of worker threads
},
"timeout_ms": 10000 # Timeout for preprocessing (in milliseconds)
}
"logging": {
"enabled": True/False, # Whether to log preprocessing metrics
"metrics": ["skew_angle", "binary_nonzero_pct", "processing_time"],
"output_path": "logs/preprocessing_metrics.json"
}
```
## Usage with OCR Processing
When processing documents, simply specify the document type:
```python
preprocessing_options = {
"document_type": "newspaper", # Use newspaper-optimized settings
"grayscale": True, # Legacy option: apply grayscale conversion
"denoise": True, # Legacy option: apply denoising
"contrast": 10, # Legacy option: adjust contrast (0-100)
"rotation": 0 # Legacy option: manual rotation (degrees)
}
# Apply preprocessing and OCR
result = process_file(file_bytes, file_ext, preprocessing_options=preprocessing_options)
```
## Visual Examples
### Original Document
*[A historical newspaper or document image would be shown here]*
### After Deskewing
*[The same document, with skew corrected]*
### After Thresholding
*[The document converted to binary with clear text]*
### After Morphological Operations
*[The binary image with small noise removed and/or gaps filled]*
## Troubleshooting
### Poor Deskewing Results
- **Symptom**: Document skew is not correctly detected or corrected
- **Solution**: Try adjusting `angle_threshold` or `max_angle`, or disable Hough transform for handwritten documents
### Thresholding Issues
- **Symptom**: Text is lost or background noise is excessive after thresholding
- **Solution**: Try changing the thresholding method or adjusting `adaptive_block_size` and `adaptive_constant`
### Performance Concerns
- **Symptom**: Processing is too slow for large documents
- **Solution**: Enable parallel processing, reduce image size, or disable some preprocessing steps for faster results