historical-ocr / memory-bank /techContext.md
milwright's picture
add memory
4c10be0

A newer version of the Streamlit SDK is available: 1.45.1

Upgrade

Technical Context: HOCR Processing Tool

1. Core Language

  • Python: The project is primarily written in Python, as indicated by the .py files. The specific version should be confirmed (e.g., via requirements.txt or environment setup).

2. Key Libraries & Frameworks

  • OCR Engine: Likely Tesseract OCR, potentially accessed via the pytesseract wrapper (common practice). This needs confirmation by inspecting ocr_processing.py or dependencies.
  • Image Processing: OpenCV (cv2) and/or Pillow (PIL) are highly probable for tasks in preprocessing.py and image_segmentation.py.
  • PDF Handling: pdf2image (which often relies on Poppler) is a common choice for converting PDF pages to images, relevant for pdf_ocr.py. Other PDF libraries like PyMuPDF or PyPDF2 might also be used.
  • Web Framework/UI: Based on app.py and the ui/ directory, Flask or Streamlit are potential candidates for the user interface or API layer.
  • Configuration: Standard Python mechanisms (e.g., .ini files with configparser, .json files, or custom Python modules like config.py).
  • Dependency Management: Likely uses pip with a requirements.txt file (observed in the file listing). Virtual environments (like venv or conda) are standard practice.

3. External Dependencies & Setup

  • Tesseract OCR Engine: Requires separate installation on the host system. The path to the Tesseract executable might need configuration.
  • Poppler: Often required by pdf2image for PDF processing; needs separate installation.
  • Python Environment: A specific Python version and installed packages via requirements.txt.
  • Environment Variables: Potential use of environment variables for configuration (e.g., API keys, paths), possibly managed via a .env file (observed in the file listing).

4. Development Environment

  • Standard Python Setup: Requires a Python interpreter, pip, and likely virtualenv.
  • Code Editor/IDE: VS Code is being used (based on environment details). Settings might be stored in .vscode/.
  • Version Control: Git is likely used (indicated by .gitignore, .gitattributes). The .git_disabled directory suggests Git might have been temporarily disabled or renamed.
  • Testing: The testing/ directory and pytest_cache suggest pytest is used for running tests.

5. Technical Constraints & Considerations

  • Performance: OCR, especially on large documents or batches, can be computationally intensive. Image processing steps add overhead.
  • Tesseract Limitations: Tesseract's accuracy depends heavily on image quality, preprocessing, and language model availability.
  • Dependency Hell: Managing Python dependencies and external binaries (Tesseract, Poppler) across different operating systems can be challenging.
  • Layout Complexity: Handling complex layouts (multi-column, tables, embedded images) requires sophisticated segmentation logic.

(This document provides an initial technical overview based on file structure and common practices. It requires verification by examining code and configuration files like requirements.txt.)