historical-ocr / docs /config_refactoring.md
milwright's picture
fix cline
2d01495

A newer version of the Streamlit SDK is available: 1.45.1

Upgrade

Configuration Refactoring

Overview

This document outlines the changes made to centralize configuration parameters and reduce technical debt in the OCR processing system.

Key Changes

Centralized Configuration

All previously hard-coded parameters have been moved to config.py and organized by functional category:

  • PDF_SETTINGS: Parameters for PDF processing
  • SEGMENTATION_SETTINGS: Image segmentation configuration
  • CACHE_SETTINGS: Cache TTL and capacity settings
  • TEXT_REPAIR_SETTINGS: Duplication detection and repair thresholds

Environment Variable Support

All configuration parameters can now be overridden via environment variables:

# Example: Override PDF DPI
export PDF_DEFAULT_DPI=200

# Example: Increase cache size
export CACHE_MAX_ENTRIES=50

Import Strategy

To prevent circular dependencies, configuration is imported at function level where needed:

def process_image():
    from config import SEGMENTATION_SETTINGS
    # Function implementation using settings

Benefits

  • Maintainability: Settings are centralized and documented
  • Flexibility: Configuration can be adjusted without code changes
  • Consistency: Standardized approach to configuration across modules
  • Traceability: Clear overview of all configurable parameters

Future Improvements

  • Add configuration schema validation
  • Support for configuration profiles (dev/test/prod)
  • Add detailed documentation for each parameter