Spaces:
Running
Running
# Image Preprocessing for Historical Document OCR | |
This document outlines the enhanced preprocessing capabilities for improving OCR quality on historical documents, including deskewing, thresholding, and morphological operations. | |
## Overview | |
The preprocessing pipeline offers several options to enhance image quality before OCR processing: | |
1. **Deskewing**: Automatically detects and corrects document skew using multiple detection algorithms | |
2. **Thresholding**: Converts grayscale images to binary using adaptive or Otsu methods with pre-blur options | |
3. **Morphological Operations**: Cleans up binary images by removing noise or filling in gaps | |
4. **Document-Type Specific Settings**: Customized preprocessing configurations for different document types | |
## Configuration | |
Preprocessing options are set in `config.py` and are tunable per document type. All settings are accessible through environment variables for easy deployment configuration. | |
### Deskewing | |
```python | |
"deskew": { | |
"enabled": True/False, # Whether to apply deskewing | |
"angle_threshold": 0.1, # Minimum angle (degrees) to trigger deskewing | |
"max_angle": 45.0, # Maximum correction angle | |
"use_hough": True/False, # Use Hough transform in addition to minAreaRect | |
"consensus_method": "average", # How to combine angle estimations | |
"fallback": {"enabled": True/False} # Fall back to original if deskewing fails | |
} | |
``` | |
Deskewing uses two methods: | |
- **minAreaRect**: Finds contours in the binary image and calculates their orientation | |
- **Hough Transform**: Detects lines in the image and their angles | |
The `consensus_method` can be: | |
- `"average"`: Average of all detected angles (most stable) | |
- `"median"`: Median of all angles (robust to outliers) | |
- `"min"`: Minimum absolute angle (most conservative) | |
- `"max"`: Maximum absolute angle (most aggressive) | |
### Thresholding | |
```python | |
"thresholding": { | |
"method": "adaptive", # "none", "otsu", or "adaptive" | |
"adaptive_block_size": 11, # Block size for adaptive thresholding (must be odd) | |
"adaptive_constant": 2, # Constant subtracted from mean | |
"otsu_gaussian_blur": 1, # Blur kernel size for Otsu pre-processing | |
"preblur": { | |
"enabled": True/False, # Whether to apply pre-blur | |
"method": "gaussian", # "gaussian" or "median" | |
"kernel_size": 3 # Blur kernel size (must be odd) | |
}, | |
"fallback": {"enabled": True/False} # Fall back to grayscale if thresholding fails | |
} | |
``` | |
Thresholding methods: | |
- **Otsu**: Automatically determines optimal global threshold (best for high-contrast documents) | |
- **Adaptive**: Calculates thresholds for different regions (better for uneven lighting, historical documents) | |
### Morphological Operations | |
```python | |
"morphology": { | |
"enabled": True/False, # Whether to apply morphological operations | |
"operation": "close", # "open", "close", "both" | |
"kernel_size": 1, # Size of the structuring element | |
"kernel_shape": "rect" # "rect", "ellipse", "cross" | |
} | |
``` | |
Morphological operations: | |
- **Open**: Erosion followed by dilation - removes small noise and disconnects thin connections | |
- **Close**: Dilation followed by erosion - fills small holes and connects broken elements | |
- **Both**: Applies opening followed by closing | |
### Document Type Configurations | |
The system includes optimized settings for different document types: | |
```python | |
"document_types": { | |
"standard": { | |
# Default settings - will use the global settings | |
}, | |
"newspaper": { | |
"deskew": {"enabled": True, "angle_threshold": 0.3, "max_angle": 10.0}, | |
"thresholding": { | |
"method": "adaptive", | |
"adaptive_block_size": 15, | |
"adaptive_constant": 3, | |
"preblur": {"method": "gaussian", "kernel_size": 3} | |
}, | |
"morphology": {"operation": "close", "kernel_size": 1} | |
}, | |
"handwritten": { | |
"deskew": {"enabled": True, "angle_threshold": 0.5, "use_hough": False}, | |
"thresholding": { | |
"method": "adaptive", | |
"adaptive_block_size": 31, | |
"adaptive_constant": 5, | |
"preblur": {"method": "median", "kernel_size": 3} | |
}, | |
"morphology": {"operation": "open", "kernel_size": 1} | |
}, | |
"book": { | |
"deskew": {"enabled": True}, | |
"thresholding": { | |
"method": "otsu", | |
"preblur": {"method": "gaussian", "kernel_size": 5} | |
}, | |
"morphology": {"operation": "both", "kernel_size": 1} | |
} | |
} | |
``` | |
## Performance and Logging | |
```python | |
"performance": { | |
"parallel": { | |
"enabled": True/False, # Whether to use parallel processing | |
"max_workers": 4 # Maximum number of worker threads | |
}, | |
"timeout_ms": 10000 # Timeout for preprocessing (in milliseconds) | |
} | |
"logging": { | |
"enabled": True/False, # Whether to log preprocessing metrics | |
"metrics": ["skew_angle", "binary_nonzero_pct", "processing_time"], | |
"output_path": "logs/preprocessing_metrics.json" | |
} | |
``` | |
## Usage with OCR Processing | |
When processing documents, simply specify the document type: | |
```python | |
preprocessing_options = { | |
"document_type": "newspaper", # Use newspaper-optimized settings | |
"grayscale": True, # Legacy option: apply grayscale conversion | |
"denoise": True, # Legacy option: apply denoising | |
"contrast": 10, # Legacy option: adjust contrast (0-100) | |
"rotation": 0 # Legacy option: manual rotation (degrees) | |
} | |
# Apply preprocessing and OCR | |
result = process_file(file_bytes, file_ext, preprocessing_options=preprocessing_options) | |
``` | |
## Visual Examples | |
### Original Document | |
*[A historical newspaper or document image would be shown here]* | |
### After Deskewing | |
*[The same document, with skew corrected]* | |
### After Thresholding | |
*[The document converted to binary with clear text]* | |
### After Morphological Operations | |
*[The binary image with small noise removed and/or gaps filled]* | |
## Troubleshooting | |
### Poor Deskewing Results | |
- **Symptom**: Document skew is not correctly detected or corrected | |
- **Solution**: Try adjusting `angle_threshold` or `max_angle`, or disable Hough transform for handwritten documents | |
### Thresholding Issues | |
- **Symptom**: Text is lost or background noise is excessive after thresholding | |
- **Solution**: Try changing the thresholding method or adjusting `adaptive_block_size` and `adaptive_constant` | |
### Performance Concerns | |
- **Symptom**: Processing is too slow for large documents | |
- **Solution**: Enable parallel processing, reduce image size, or disable some preprocessing steps for faster results | |