File size: 6,849 Bytes
42dc069
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
# Image Preprocessing for Historical Document OCR

This document outlines the enhanced preprocessing capabilities for improving OCR quality on historical documents, including deskewing, thresholding, and morphological operations.

## Overview

The preprocessing pipeline offers several options to enhance image quality before OCR processing:

1. **Deskewing**: Automatically detects and corrects document skew using multiple detection algorithms
2. **Thresholding**: Converts grayscale images to binary using adaptive or Otsu methods with pre-blur options
3. **Morphological Operations**: Cleans up binary images by removing noise or filling in gaps
4. **Document-Type Specific Settings**: Customized preprocessing configurations for different document types

## Configuration

Preprocessing options are set in `config.py` and are tunable per document type. All settings are accessible through environment variables for easy deployment configuration.

### Deskewing

```python
"deskew": {
    "enabled": True/False,              # Whether to apply deskewing
    "angle_threshold": 0.1,             # Minimum angle (degrees) to trigger deskewing
    "max_angle": 45.0,                  # Maximum correction angle
    "use_hough": True/False,            # Use Hough transform in addition to minAreaRect
    "consensus_method": "average",      # How to combine angle estimations
    "fallback": {"enabled": True/False} # Fall back to original if deskewing fails
}
```

Deskewing uses two methods:
- **minAreaRect**: Finds contours in the binary image and calculates their orientation
- **Hough Transform**: Detects lines in the image and their angles

The `consensus_method` can be:
- `"average"`: Average of all detected angles (most stable)
- `"median"`: Median of all angles (robust to outliers)
- `"min"`: Minimum absolute angle (most conservative)
- `"max"`: Maximum absolute angle (most aggressive)

### Thresholding

```python
"thresholding": {
    "method": "adaptive",               # "none", "otsu", or "adaptive"
    "adaptive_block_size": 11,          # Block size for adaptive thresholding (must be odd)
    "adaptive_constant": 2,             # Constant subtracted from mean
    "otsu_gaussian_blur": 1,            # Blur kernel size for Otsu pre-processing
    "preblur": {
        "enabled": True/False,          # Whether to apply pre-blur
        "method": "gaussian",           # "gaussian" or "median"
        "kernel_size": 3                # Blur kernel size (must be odd)
    },
    "fallback": {"enabled": True/False} # Fall back to grayscale if thresholding fails
}
```

Thresholding methods:
- **Otsu**: Automatically determines optimal global threshold (best for high-contrast documents)
- **Adaptive**: Calculates thresholds for different regions (better for uneven lighting, historical documents)

### Morphological Operations

```python
"morphology": {
    "enabled": True/False,              # Whether to apply morphological operations
    "operation": "close",               # "open", "close", "both"
    "kernel_size": 1,                   # Size of the structuring element
    "kernel_shape": "rect"              # "rect", "ellipse", "cross"
}
```

Morphological operations:
- **Open**: Erosion followed by dilation - removes small noise and disconnects thin connections
- **Close**: Dilation followed by erosion - fills small holes and connects broken elements
- **Both**: Applies opening followed by closing

### Document Type Configurations

The system includes optimized settings for different document types:

```python
"document_types": {
    "standard": {
        # Default settings - will use the global settings
    },
    "newspaper": {
        "deskew": {"enabled": True, "angle_threshold": 0.3, "max_angle": 10.0},
        "thresholding": {
            "method": "adaptive", 
            "adaptive_block_size": 15,
            "adaptive_constant": 3,
            "preblur": {"method": "gaussian", "kernel_size": 3}
        },
        "morphology": {"operation": "close", "kernel_size": 1}
    },
    "handwritten": {
        "deskew": {"enabled": True, "angle_threshold": 0.5, "use_hough": False},
        "thresholding": {
            "method": "adaptive", 
            "adaptive_block_size": 31, 
            "adaptive_constant": 5,
            "preblur": {"method": "median", "kernel_size": 3}
        },
        "morphology": {"operation": "open", "kernel_size": 1}
    },
    "book": {
        "deskew": {"enabled": True},
        "thresholding": {
            "method": "otsu",
            "preblur": {"method": "gaussian", "kernel_size": 5}
        },
        "morphology": {"operation": "both", "kernel_size": 1}
    }
}
```

## Performance and Logging

```python
"performance": {
    "parallel": {
        "enabled": True/False,          # Whether to use parallel processing
        "max_workers": 4                # Maximum number of worker threads
    },
    "timeout_ms": 10000                 # Timeout for preprocessing (in milliseconds)
}

"logging": {
    "enabled": True/False,              # Whether to log preprocessing metrics
    "metrics": ["skew_angle", "binary_nonzero_pct", "processing_time"],
    "output_path": "logs/preprocessing_metrics.json"
}
```

## Usage with OCR Processing

When processing documents, simply specify the document type:

```python
preprocessing_options = {
    "document_type": "newspaper",  # Use newspaper-optimized settings
    "grayscale": True,             # Legacy option: apply grayscale conversion
    "denoise": True,               # Legacy option: apply denoising
    "contrast": 10,                # Legacy option: adjust contrast (0-100)
    "rotation": 0                  # Legacy option: manual rotation (degrees)
}

# Apply preprocessing and OCR
result = process_file(file_bytes, file_ext, preprocessing_options=preprocessing_options)
```

## Visual Examples

### Original Document
*[A historical newspaper or document image would be shown here]*

### After Deskewing
*[The same document, with skew corrected]*

### After Thresholding
*[The document converted to binary with clear text]*

### After Morphological Operations
*[The binary image with small noise removed and/or gaps filled]*

## Troubleshooting

### Poor Deskewing Results
- **Symptom**: Document skew is not correctly detected or corrected
- **Solution**: Try adjusting `angle_threshold` or `max_angle`, or disable Hough transform for handwritten documents

### Thresholding Issues
- **Symptom**: Text is lost or background noise is excessive after thresholding
- **Solution**: Try changing the thresholding method or adjusting `adaptive_block_size` and `adaptive_constant`

### Performance Concerns
- **Symptom**: Processing is too slow for large documents
- **Solution**: Enable parallel processing, reduce image size, or disable some preprocessing steps for faster results