Spaces:

milwright
/

historical-ocr

Running

App Files Files Community

historical-ocr / docs /preprocessing.md

milwright

Consolidate segmentation improvements and code cleanup

42dc069 19 days ago

preview code

raw

history blame contribute delete

6.85 kB

	# Image Preprocessing for Historical Document OCR

	This document outlines the enhanced preprocessing capabilities for improving OCR quality on historical documents, including deskewing, thresholding, and morphological operations.

	## Overview

	The preprocessing pipeline offers several options to enhance image quality before OCR processing:

	1. Deskewing: Automatically detects and corrects document skew using multiple detection algorithms
	2. Thresholding: Converts grayscale images to binary using adaptive or Otsu methods with pre-blur options
	3. Morphological Operations: Cleans up binary images by removing noise or filling in gaps
	4. Document-Type Specific Settings: Customized preprocessing configurations for different document types

	## Configuration

	Preprocessing options are set in `config.py` and are tunable per document type. All settings are accessible through environment variables for easy deployment configuration.

	### Deskewing

	```python
	"deskew": {
	"enabled": True/False, # Whether to apply deskewing
	"angle_threshold": 0.1, # Minimum angle (degrees) to trigger deskewing
	"max_angle": 45.0, # Maximum correction angle
	"use_hough": True/False, # Use Hough transform in addition to minAreaRect
	"consensus_method": "average", # How to combine angle estimations
	"fallback": {"enabled": True/False} # Fall back to original if deskewing fails
	}
	```

	Deskewing uses two methods:
	- minAreaRect: Finds contours in the binary image and calculates their orientation
	- Hough Transform: Detects lines in the image and their angles

	The `consensus_method` can be:
	- `"average"`: Average of all detected angles (most stable)
	- `"median"`: Median of all angles (robust to outliers)
	- `"min"`: Minimum absolute angle (most conservative)
	- `"max"`: Maximum absolute angle (most aggressive)

	### Thresholding

	```python
	"thresholding": {
	"method": "adaptive", # "none", "otsu", or "adaptive"
	"adaptive_block_size": 11, # Block size for adaptive thresholding (must be odd)
	"adaptive_constant": 2, # Constant subtracted from mean
	"otsu_gaussian_blur": 1, # Blur kernel size for Otsu pre-processing
	"preblur": {
	"enabled": True/False, # Whether to apply pre-blur
	"method": "gaussian", # "gaussian" or "median"
	"kernel_size": 3 # Blur kernel size (must be odd)
	},
	"fallback": {"enabled": True/False} # Fall back to grayscale if thresholding fails
	}
	```

	Thresholding methods:
	- Otsu: Automatically determines optimal global threshold (best for high-contrast documents)
	- Adaptive: Calculates thresholds for different regions (better for uneven lighting, historical documents)

	### Morphological Operations

	```python
	"morphology": {
	"enabled": True/False, # Whether to apply morphological operations
	"operation": "close", # "open", "close", "both"
	"kernel_size": 1, # Size of the structuring element
	"kernel_shape": "rect" # "rect", "ellipse", "cross"
	}
	```

	Morphological operations:
	- Open: Erosion followed by dilation - removes small noise and disconnects thin connections
	- Close: Dilation followed by erosion - fills small holes and connects broken elements
	- Both: Applies opening followed by closing

	### Document Type Configurations

	The system includes optimized settings for different document types:

	```python
	"document_types": {
	"standard": {
	# Default settings - will use the global settings
	},
	"newspaper": {
	"deskew": {"enabled": True, "angle_threshold": 0.3, "max_angle": 10.0},
	"thresholding": {
	"method": "adaptive",
	"adaptive_block_size": 15,
	"adaptive_constant": 3,
	"preblur": {"method": "gaussian", "kernel_size": 3}
	},
	"morphology": {"operation": "close", "kernel_size": 1}
	},
	"handwritten": {
	"deskew": {"enabled": True, "angle_threshold": 0.5, "use_hough": False},
	"thresholding": {
	"method": "adaptive",
	"adaptive_block_size": 31,
	"adaptive_constant": 5,
	"preblur": {"method": "median", "kernel_size": 3}
	},
	"morphology": {"operation": "open", "kernel_size": 1}
	},
	"book": {
	"deskew": {"enabled": True},
	"thresholding": {
	"method": "otsu",
	"preblur": {"method": "gaussian", "kernel_size": 5}
	},
	"morphology": {"operation": "both", "kernel_size": 1}
	}
	}
	```

	## Performance and Logging

	```python
	"performance": {
	"parallel": {
	"enabled": True/False, # Whether to use parallel processing
	"max_workers": 4 # Maximum number of worker threads
	},
	"timeout_ms": 10000 # Timeout for preprocessing (in milliseconds)
	}

	"logging": {
	"enabled": True/False, # Whether to log preprocessing metrics
	"metrics": ["skew_angle", "binary_nonzero_pct", "processing_time"],
	"output_path": "logs/preprocessing_metrics.json"
	}
	```

	## Usage with OCR Processing

	When processing documents, simply specify the document type:

	```python
	preprocessing_options = {
	"document_type": "newspaper", # Use newspaper-optimized settings
	"grayscale": True, # Legacy option: apply grayscale conversion
	"denoise": True, # Legacy option: apply denoising
	"contrast": 10, # Legacy option: adjust contrast (0-100)
	"rotation": 0 # Legacy option: manual rotation (degrees)
	}

	# Apply preprocessing and OCR
	result = process_file(file_bytes, file_ext, preprocessing_options=preprocessing_options)
	```

	## Visual Examples

	### Original Document
	[A historical newspaper or document image would be shown here]

	### After Deskewing
	[The same document, with skew corrected]

	### After Thresholding
	[The document converted to binary with clear text]

	### After Morphological Operations
	[The binary image with small noise removed and/or gaps filled]

	## Troubleshooting

	### Poor Deskewing Results
	- Symptom: Document skew is not correctly detected or corrected
	- Solution: Try adjusting `angle_threshold` or `max_angle`, or disable Hough transform for handwritten documents

	### Thresholding Issues
	- Symptom: Text is lost or background noise is excessive after thresholding
	- Solution: Try changing the thresholding method or adjusting `adaptive_block_size` and `adaptive_constant`

	### Performance Concerns
	- Symptom: Processing is too slow for large documents
	- Solution: Enable parallel processing, reduce image size, or disable some preprocessing steps for faster results