Spaces:

seanpedrickcase
/

document_redaction

Running

App Files Files Community

seanpedrickcase commited on 21 days ago

Commit

4852fb5

1 Parent(s): 419fb7d

Added regex functionality to deny lists. Corrected tesseract to word level parsing. Improved review search regex capabilities. Updated documentation

Browse files

Files changed (8) hide show

README.md +1 -1
app.py +6 -3
src/app_settings.qmd +12 -0
src/user_guide.qmd +3 -1
tools/config.py +3 -1
tools/custom_image_analyser_engine.py +54 -50
tools/find_duplicate_pages.py +107 -17
tools/load_spacy_model_custom_recognisers.py +126 -6

README.md CHANGED Viewed

@@ -14,7 +14,7 @@ version: 1.5.2
 Redact personally identifiable information (PII) from documents (pdf, png, jpg), Word files (docx), or tabular data (xlsx/csv/parquet). Please see the [User Guide](#user-guide) for a full walkthrough of all the features in the app.
-To identify text in documents, the 'Local' text extraction uses PikePDF, and OCR image analysis uses Tesseract, and works well only for documents with typed text or scanned PDFs with clear text. Use AWS Textract to extract more complex elements e.g. handwriting, signatures, or unclear text. PaddleOCR and VLM support is also provided (see the installation instructions below).
 For PII identification, 'Local' (based on spaCy) gives good results if you are looking for common names or terms, or a custom list of terms to redact (see Redaction settings).  AWS Comprehend gives better results at a small cost.

 Redact personally identifiable information (PII) from documents (pdf, png, jpg), Word files (docx), or tabular data (xlsx/csv/parquet). Please see the [User Guide](#user-guide) for a full walkthrough of all the features in the app.
+To extract text from documents, the 'Local' options are PikePDF for PDFs with selectable text, and OCR with Tesseract. Use AWS Textract to extract more complex elements e.g. handwriting, signatures, or unclear text. PaddleOCR and VLM support is also provided (see the installation instructions below).
 For PII identification, 'Local' (based on spaCy) gives good results if you are looking for common names or terms, or a custom list of terms to redact (see Redaction settings).  AWS Comprehend gives better results at a small cost.

app.py CHANGED Viewed

@@ -1034,7 +1034,7 @@ with blocks:
     Redact personally identifiable information (PII) from documents (pdf, png, jpg), Word files (docx), or tabular data (xlsx/csv/parquet). Please see the [User Guide]({USER_GUIDE_URL}) for a full walkthrough of all the features in the app.
-    To identify text in documents, the 'Local' text extraction uses PikePDF, and OCR image analysis uses Tesseract, and works well only for documents with typed text or scanned PDFs with clear text. Use AWS Textract to extract more complex elements e.g. handwriting, signatures, or unclear text. For PII identification, 'Local' (based on spaCy) gives good results if you are looking for common names or terms, or a custom list of terms to redact (see Redaction settings).  AWS Comprehend gives better results at a small cost.
     Additional options on the 'Redaction settings' include, the type of information to redact (e.g. people, places), custom terms to include/ exclude from redaction, fuzzy matching, language settings, and whole page redaction. After redaction is complete, you can view and modify suggested redactions on the 'Review redactions' tab to quickly create a final redacted document.
@@ -1688,7 +1688,9 @@ with blocks:
                         with gr.Row(equal_height=True):
                             multi_word_search_text = gr.Textbox(
-                                label="Multi-word text search", value="", scale=4
                             )
                             multi_word_search_text_btn = gr.Button(
                                 value="Search", scale=1
@@ -4756,7 +4758,8 @@ with blocks:
             duplicate_files_out,
             full_duplicate_data_by_file,
         ],
-    api_name="word_level_ocr_text_search")
     # Clicking on a cell in the redact items table will take you to that page
     all_page_line_level_ocr_results_with_words_df.select(

     Redact personally identifiable information (PII) from documents (pdf, png, jpg), Word files (docx), or tabular data (xlsx/csv/parquet). Please see the [User Guide]({USER_GUIDE_URL}) for a full walkthrough of all the features in the app.
+    To extract text from documents, the 'Local' options are PikePDF for PDFs with selectable text, and OCR with Tesseract. Use AWS Textract to extract more complex elements e.g. handwriting, signatures, or unclear text. For PII identification, 'Local' (based on spaCy) gives good results if you are looking for common names or terms, or a custom list of terms to redact (see Redaction settings).  AWS Comprehend gives better results at a small cost.
     Additional options on the 'Redaction settings' include, the type of information to redact (e.g. people, places), custom terms to include/ exclude from redaction, fuzzy matching, language settings, and whole page redaction. After redaction is complete, you can view and modify suggested redactions on the 'Review redactions' tab to quickly create a final redacted document.
                         with gr.Row(equal_height=True):
                             multi_word_search_text = gr.Textbox(
+                                label="Multi-word text search (regex enabled below)",
+                                value="",
+                                scale=4,
                             )
                             multi_word_search_text_btn = gr.Button(
                                 value="Search", scale=1
             duplicate_files_out,
             full_duplicate_data_by_file,
         ],
+        api_name="word_level_ocr_text_search",
+    )
     # Clicking on a cell in the redact items table will take you to that page
     all_page_line_level_ocr_results_with_words_df.select(

src/app_settings.qmd CHANGED Viewed

@@ -173,6 +173,10 @@ Configurations for the Gradio UI, server behavior, and application limits.
     * **Description:** If set to `"True"`, the application will be served via FastAPI, allowing for API endpoint integration.
     * **Default Value:** `"False"`
 * **`GRADIO_SERVER_NAME`**
     * **Description:** The IP address the Gradio server will bind to. Use `"0.0.0.0"` to allow external access.
     * **Default Value:** `"0.0.0.0"`
@@ -347,6 +351,10 @@ Configurations related to text extraction, PII detection, and the redaction proc
     * **Description:** Saves images with PaddleOCR's detected bounding boxes overlaid.
     * **Default Value:** `"False"`
 * **`PREPROCESS_LOCAL_OCR_IMAGES`**
     * **Description:** If set to `"True"`, images will be preprocessed before local OCR. Can slow down processing.
     * **Default Value:** `"True"`
@@ -367,6 +375,10 @@ Configurations related to text extraction, PII detection, and the redaction proc
     * **Description:** Tesseract PSM (Page Segmentation Mode) level to use for OCR. Valid values are 0-13.
     * **Default Value:** `11`
 * **`CONVERT_LINE_TO_WORD_LEVEL`**
     * **Description:** If set to `"True"`, converts PaddleOCR line-level OCR results to word-level for better precision.
     * **Default Value:** `"False"`

     * **Description:** If set to `"True"`, the application will be served via FastAPI, allowing for API endpoint integration.
     * **Default Value:** `"False"`
+* **`RUN_MCP_SERVER`**
+    * **Description:** If set to `"True"`, the application will run as an MCP (Model Context Protocol) server.
+    * **Default Value:** `"False"`
 * **`GRADIO_SERVER_NAME`**
     * **Description:** The IP address the Gradio server will bind to. Use `"0.0.0.0"` to allow external access.
     * **Default Value:** `"0.0.0.0"`
     * **Description:** Saves images with PaddleOCR's detected bounding boxes overlaid.
     * **Default Value:** `"False"`
+* **`SAVE_WORD_SEGMENTER_OUTPUT_IMAGES`**
+    * **Description:** If set to `"True"`, saves output images from the word segmenter for debugging purposes.
+    * **Default Value:** `"False"`
 * **`PREPROCESS_LOCAL_OCR_IMAGES`**
     * **Description:** If set to `"True"`, images will be preprocessed before local OCR. Can slow down processing.
     * **Default Value:** `"True"`
     * **Description:** Tesseract PSM (Page Segmentation Mode) level to use for OCR. Valid values are 0-13.
     * **Default Value:** `11`
+* **`TESSERACT_WORD_LEVEL_OCR`**
+    * **Description:** If set to `"True"`, uses Tesseract word-level OCR instead of line-level.
+    * **Default Value:** `"True"`
 * **`CONVERT_LINE_TO_WORD_LEVEL`**
     * **Description:** If set to `"True"`, converts PaddleOCR line-level OCR results to word-level for better precision.
     * **Default Value:** `"False"`

src/user_guide.qmd CHANGED Viewed

@@ -194,6 +194,8 @@ To import this to use with your redaction tasks, go to the 'Redaction settings'
 Say you wanted to remove specific terms from a document. In this app you can do this by providing a custom deny list as a csv. Like for the allow list described above, this should be a one-column csv without a column header. The app will suggest each individual term in the list with exact spelling as whole words. So it won't select text from within words. To enable this feature, the 'CUSTOM' tag needs to be chosen as a redaction entity [(the process for adding/removing entity types to redact is described below)](#redacting-additional-types-of-personal-information).
 Here is an example using the [Partnership Agreement Toolkit file](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/Partnership-Agreement-Toolkit_0_0.pdf). This is an [example of a custom deny list file](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/allow_list/partnership_toolkit_redact_custom_deny_list.csv). 'Sister', 'Sister City'
 'Sister Cities', 'Friendship City' have been listed as specific terms to redact. You can see the outputs of this redaction process on the review page:
@@ -367,7 +369,7 @@ The workflow is designed to be simple: **Search → Select → Redact**.
 1.  Navigate to the **"Search text to make new redactions"** tab.
 2.  The main table will initially be populated with all the text extracted from the document for a page, broken down by word.
-3.  To narrow this down, use the **"Multi-word text search"** box to type the word or phrase you want to find (this will search the whole document). If you want to do a regex-based search, tick the 'Enable regex pattern matching' box under 'Search options' below (Note this will only be able to search for patterns in text within each cell).
 4.  Click the **"Search"** button or press Enter.
 5.  The table below will update to show only the rows containing text that matches your search query.

 Say you wanted to remove specific terms from a document. In this app you can do this by providing a custom deny list as a csv. Like for the allow list described above, this should be a one-column csv without a column header. The app will suggest each individual term in the list with exact spelling as whole words. So it won't select text from within words. To enable this feature, the 'CUSTOM' tag needs to be chosen as a redaction entity [(the process for adding/removing entity types to redact is described below)](#redacting-additional-types-of-personal-information).
+**NOTE:** As of version 1.5.2, you can now provide deny list terms as regex patterns.
 Here is an example using the [Partnership Agreement Toolkit file](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/Partnership-Agreement-Toolkit_0_0.pdf). This is an [example of a custom deny list file](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/allow_list/partnership_toolkit_redact_custom_deny_list.csv). 'Sister', 'Sister City'
 'Sister Cities', 'Friendship City' have been listed as specific terms to redact. You can see the outputs of this redaction process on the review page:
 1.  Navigate to the **"Search text to make new redactions"** tab.
 2.  The main table will initially be populated with all the text extracted from the document for a page, broken down by word.
+3.  To narrow this down, use the **"Multi-word text search"** box to type the word or phrase you want to find (this will search the whole document). If you want to do a regex-based search, tick the 'Enable regex pattern matching' box under 'Search options' below.
 4.  Click the **"Search"** button or press Enter.
 5.  The table below will update to show only the rows containing text that matches your search query.

tools/config.py CHANGED Viewed

@@ -281,7 +281,9 @@ FAVICON_PATH = get_or_create_env_var("FAVICON_PATH", "favicon.png")
 RUN_FASTAPI = convert_string_to_boolean(get_or_create_env_var("RUN_FASTAPI", "False"))
-RUN_MCP_SERVER = convert_string_to_boolean(get_or_create_env_var("RUN_MCP_SERVER", "False"))
 MAX_QUEUE_SIZE = int(get_or_create_env_var("MAX_QUEUE_SIZE", "5"))

 RUN_FASTAPI = convert_string_to_boolean(get_or_create_env_var("RUN_FASTAPI", "False"))
+RUN_MCP_SERVER = convert_string_to_boolean(
+    get_or_create_env_var("RUN_MCP_SERVER", "False")
+)
 MAX_QUEUE_SIZE = int(get_or_create_env_var("MAX_QUEUE_SIZE", "5"))

tools/custom_image_analyser_engine.py CHANGED Viewed

@@ -1240,7 +1240,7 @@ class CustomImageAnalyzerEngine:
             print(
                 f"Warning: Image dimension mismatch! Expected {image_width}x{image_height}, but got {actual_width}x{actual_height}"
             )
-            #print(f"Using actual dimensions: {actual_width}x{actual_height}")
             # Update to use actual dimensions
             image_width = actual_width
             image_height = actual_height
@@ -1598,26 +1598,28 @@ class CustomImageAnalyzerEngine:
     # Calculate line-level bounding boxes and average confidence
     def _calculate_line_bbox(self, group):
         # Get the leftmost and rightmost positions
-        left = group['left'].min()
-        top = group['top'].min()
-        right = (group['left'] + group['width']).max()
-        bottom = (group['top'] + group['height']).max()
         # Calculate width and height
         width = right - left
         height = bottom - top
         # Calculate average confidence
-        avg_conf = round(group['conf'].mean(), 0)
-        return pd.Series({
-            'text': ' '.join(group['text'].astype(str).tolist()),
-            'left': left,
-            'top': top,
-            'width': width,
-            'height': height,
-            'conf': avg_conf
-        })
     def _perform_hybrid_ocr(
         self,
@@ -1628,7 +1630,7 @@ class CustomImageAnalyzerEngine:
         image_name: str = "unknown_image_name",
     ) -> Dict[str, list]:
         """
-        Performs hybrid OCR on an image using Tesseract for initial OCR and PaddleOCR/VLM to enhance
         results for low-confidence or uncertain words.
         Args:
@@ -1637,12 +1639,12 @@ class CustomImageAnalyzerEngine:
                 re-analyzed with secondary OCR (PaddleOCR/VLM). Defaults to HYBRID_OCR_CONFIDENCE_THRESHOLD.
             padding (int, optional): Pixel padding (in all directions) to add around each word box when
                 cropping for secondary OCR. Defaults to HYBRID_OCR_PADDING.
-            ocr (Optional[Any], optional): An instance of the PaddleOCR or VLM engine. If None, will use the
                 instance's `paddle_ocr` attribute if available. Only necessary for PaddleOCR-based pipelines.
             image_name (str, optional): Optional name of the image, useful for debugging and visualization.
         Returns:
-            Dict[str, list]: OCR results in the dictionary format of pytesseract.image_to_data (keys:
                 'text', 'left', 'top', 'width', 'height', 'conf', 'model', ...).
         """
         # Determine if we're using VLM or PaddleOCR
@@ -1657,36 +1659,36 @@ class CustomImageAnalyzerEngine:
                         "No OCR object provided and 'paddle_ocr' is not initialized."
                     )
-        #print("Starting hybrid OCR process...")
-        # 1. Get initial word-level results from Tesseract
         tesseract_data = pytesseract.image_to_data(
             image,
             output_type=pytesseract.Output.DICT,
             config=self.tesseract_config,
             lang=self.tesseract_lang,
         )
         if TESSERACT_WORD_LEVEL_OCR is False:
             ocr_df = pd.DataFrame(tesseract_data)
             # Filter out invalid entries (confidence == -1)
             ocr_df = ocr_df[ocr_df.conf != -1]
             # Group by line and aggregate text
-            line_groups = ocr_df.groupby(['block_num', 'par_num', 'line_num'])
             ocr_data = line_groups.apply(self._calculate_line_bbox).reset_index()
             # Overwrite tesseract_data with the aggregated data
             tesseract_data = {
-                'text': ocr_data['text'].tolist(),
-                'left': ocr_data['left'].astype(int).tolist(),
-                'top': ocr_data['top'].astype(int).tolist(),
-                'width': ocr_data['width'].astype(int).tolist(),
-                'height': ocr_data['height'].astype(int).tolist(),
-                'conf': ocr_data['conf'].tolist(),
-                'model': ['Tesseract'] * len(ocr_data)  # Add model field
             }
         final_data = {
@@ -2262,24 +2264,24 @@ class CustomImageAnalyzerEngine:
             if TESSERACT_WORD_LEVEL_OCR is False:
                 ocr_df = pd.DataFrame(ocr_data)
                 # Filter out invalid entries (confidence == -1)
                 ocr_df = ocr_df[ocr_df.conf != -1]
                 # Group by line and aggregate text
-                line_groups = ocr_df.groupby(['block_num', 'par_num', 'line_num'])
-                ocr_data = line_groups.apply(self._calculate_line_bbox).reset_index()
                 # Convert DataFrame to dictionary of lists format expected by downstream code
                 ocr_data = {
-                    'text': ocr_data['text'].tolist(),
-                    'left': ocr_data['left'].astype(int).tolist(),
-                    'top': ocr_data['top'].astype(int).tolist(),
-                    'width': ocr_data['width'].astype(int).tolist(),
-                    'height': ocr_data['height'].astype(int).tolist(),
-                    'conf': ocr_data['conf'].tolist(),
-                    'model': ['Tesseract'] * len(ocr_data)  # Add model field
                 }
         elif self.ocr_engine == "paddle" or self.ocr_engine == "hybrid-paddle-vlm":
@@ -2457,8 +2459,8 @@ class CustomImageAnalyzerEngine:
         # Convert line-level results to word-level if configured and needed
         if CONVERT_LINE_TO_WORD_LEVEL and self._is_line_level_data(ocr_data):
-            #print("Converting line-level OCR results to word-level...")
             # Check if coordinates need to be scaled to match the image we're cropping from
             # For PaddleOCR: _convert_paddle_to_tesseract_format converts coordinates to original image space
             #   - If PaddleOCR processed the original image (image_path provided), crop from original image (no scaling)
@@ -2496,7 +2498,10 @@ class CustomImageAnalyzerEngine:
                 elif self.ocr_engine == "tesseract":
                     # For Tesseract: if scale_factor != 1.0, rescale_ocr_data converted coordinates to original space
                     # So we need to crop from the original image, not the preprocessed image
-                    if scale_factor != 1.0 and original_image_for_visualization is not None:
                         # Coordinates are in original space, so crop from original image
                         crop_image = original_image_for_visualization
                         crop_image_width = original_image_width
@@ -2589,7 +2594,6 @@ class CustomImageAnalyzerEngine:
             def get_model(idx):
                 return default_model
         output = [
             OCRResult(
                 text=clean_unicode_text(ocr_result["text"][i]),

             print(
                 f"Warning: Image dimension mismatch! Expected {image_width}x{image_height}, but got {actual_width}x{actual_height}"
             )
+            # print(f"Using actual dimensions: {actual_width}x{actual_height}")
             # Update to use actual dimensions
             image_width = actual_width
             image_height = actual_height
     # Calculate line-level bounding boxes and average confidence
     def _calculate_line_bbox(self, group):
         # Get the leftmost and rightmost positions
+        left = group["left"].min()
+        top = group["top"].min()
+        right = (group["left"] + group["width"]).max()
+        bottom = (group["top"] + group["height"]).max()
         # Calculate width and height
         width = right - left
         height = bottom - top
         # Calculate average confidence
+        avg_conf = round(group["conf"].mean(), 0)
+        return pd.Series(
+            {
+                "text": " ".join(group["text"].astype(str).tolist()),
+                "left": left,
+                "top": top,
+                "width": width,
+                "height": height,
+                "conf": avg_conf,
+            }
+        )
     def _perform_hybrid_ocr(
         self,
         image_name: str = "unknown_image_name",
     ) -> Dict[str, list]:
         """
+        Performs hybrid OCR on an image using Tesseract for initial OCR and PaddleOCR/VLM to enhance
         results for low-confidence or uncertain words.
         Args:
                 re-analyzed with secondary OCR (PaddleOCR/VLM). Defaults to HYBRID_OCR_CONFIDENCE_THRESHOLD.
             padding (int, optional): Pixel padding (in all directions) to add around each word box when
                 cropping for secondary OCR. Defaults to HYBRID_OCR_PADDING.
+            ocr (Optional[Any], optional): An instance of the PaddleOCR or VLM engine. If None, will use the
                 instance's `paddle_ocr` attribute if available. Only necessary for PaddleOCR-based pipelines.
             image_name (str, optional): Optional name of the image, useful for debugging and visualization.
         Returns:
+            Dict[str, list]: OCR results in the dictionary format of pytesseract.image_to_data (keys:
                 'text', 'left', 'top', 'width', 'height', 'conf', 'model', ...).
         """
         # Determine if we're using VLM or PaddleOCR
                         "No OCR object provided and 'paddle_ocr' is not initialized."
                     )
+        # print("Starting hybrid OCR process...")
+        # 1. Get initial word-level results from Tesseract
         tesseract_data = pytesseract.image_to_data(
             image,
             output_type=pytesseract.Output.DICT,
             config=self.tesseract_config,
             lang=self.tesseract_lang,
         )
         if TESSERACT_WORD_LEVEL_OCR is False:
             ocr_df = pd.DataFrame(tesseract_data)
             # Filter out invalid entries (confidence == -1)
             ocr_df = ocr_df[ocr_df.conf != -1]
             # Group by line and aggregate text
+            line_groups = ocr_df.groupby(["block_num", "par_num", "line_num"])
             ocr_data = line_groups.apply(self._calculate_line_bbox).reset_index()
             # Overwrite tesseract_data with the aggregated data
             tesseract_data = {
+                "text": ocr_data["text"].tolist(),
+                "left": ocr_data["left"].astype(int).tolist(),
+                "top": ocr_data["top"].astype(int).tolist(),
+                "width": ocr_data["width"].astype(int).tolist(),
+                "height": ocr_data["height"].astype(int).tolist(),
+                "conf": ocr_data["conf"].tolist(),
+                "model": ["Tesseract"] * len(ocr_data),  # Add model field
             }
         final_data = {
             if TESSERACT_WORD_LEVEL_OCR is False:
                 ocr_df = pd.DataFrame(ocr_data)
                 # Filter out invalid entries (confidence == -1)
                 ocr_df = ocr_df[ocr_df.conf != -1]
                 # Group by line and aggregate text
+                line_groups = ocr_df.groupby(["block_num", "par_num", "line_num"])
+                ocr_data = line_groups.apply(self._calculate_line_bbox).reset_index()
                 # Convert DataFrame to dictionary of lists format expected by downstream code
                 ocr_data = {
+                    "text": ocr_data["text"].tolist(),
+                    "left": ocr_data["left"].astype(int).tolist(),
+                    "top": ocr_data["top"].astype(int).tolist(),
+                    "width": ocr_data["width"].astype(int).tolist(),
+                    "height": ocr_data["height"].astype(int).tolist(),
+                    "conf": ocr_data["conf"].tolist(),
+                    "model": ["Tesseract"] * len(ocr_data),  # Add model field
                 }
         elif self.ocr_engine == "paddle" or self.ocr_engine == "hybrid-paddle-vlm":
         # Convert line-level results to word-level if configured and needed
         if CONVERT_LINE_TO_WORD_LEVEL and self._is_line_level_data(ocr_data):
+            # print("Converting line-level OCR results to word-level...")
             # Check if coordinates need to be scaled to match the image we're cropping from
             # For PaddleOCR: _convert_paddle_to_tesseract_format converts coordinates to original image space
             #   - If PaddleOCR processed the original image (image_path provided), crop from original image (no scaling)
                 elif self.ocr_engine == "tesseract":
                     # For Tesseract: if scale_factor != 1.0, rescale_ocr_data converted coordinates to original space
                     # So we need to crop from the original image, not the preprocessed image
+                    if (
+                        scale_factor != 1.0
+                        and original_image_for_visualization is not None
+                    ):
                         # Coordinates are in original space, so crop from original image
                         crop_image = original_image_for_visualization
                         crop_image_width = original_image_width
             def get_model(idx):
                 return default_model
         output = [
             OCRResult(
                 text=clean_unicode_text(ocr_result["text"][i]),

tools/find_duplicate_pages.py CHANGED Viewed

@@ -854,21 +854,83 @@ def find_consecutive_sequence_matches(
             reference_tokens = reference_df["text_clean"].tolist()
         reference_indices = reference_df.index.tolist()
-        # Join tokens with spaces to reconstruct the text
-        # Note: If tokens were split at special characters like @, this may not perfectly reconstruct
-        # the original text, but it's the best we can do with tokenized data
-        reference_text = " ".join(reference_tokens)
-        # Build a mapping from character positions to token indices
-        # This helps us map regex match positions back to token indices
         char_to_token_map = []
         current_pos = 0
         for idx, token in enumerate(reference_tokens):
-            token_start = current_pos
-            token_end = current_pos + len(token)
-            char_to_token_map.append((token_start, token_end, reference_indices[idx]))
-            # Add 1 for the space separator (except after last token)
-            current_pos = token_end + (1 if idx < len(reference_tokens) - 1 else 0)
         # Find all regex matches
         try:
@@ -891,21 +953,49 @@ def find_consecutive_sequence_matches(
         all_found_matches = []
         query_index = search_df.index[0]  # Use the first (and only) query index
-        # For each regex match, find which tokens it spans
         for match in matches:
             match_start = match.start()
             match_end = match.end()
             # Find all tokens that overlap with this match
             matching_token_indices = []
             for token_start, token_end, token_idx in char_to_token_map:
-                # Check if token overlaps with match
-                if not (token_end < match_start or token_start > match_end):
                     matching_token_indices.append(token_idx)
-            # Create matches for all tokens in the span
             for token_idx in matching_token_indices:
-                all_found_matches.append((query_index, token_idx, 1))
         print(
             f"Found {len(matches)} regex match(es) spanning {len(set(idx for _, idx, _ in all_found_matches))} token(s)"

             reference_tokens = reference_df["text_clean"].tolist()
         reference_indices = reference_df.index.tolist()
+        # Concatenate ALL tokens into a single continuous string with smart spacing
+        # Rules:
+        # - Words are joined with single spaces
+        # - Punctuation (periods, commas, etc.) touches adjacent tokens directly (no spaces)
+        # Example: ["Hi", ".", "How", "are", "you", "?", "Great"] -> "Hi.How are you?Great"
+        # This allows regex patterns to span multiple tokens naturally while preserving word boundaries
+        def is_punctuation_only(token):
+            """Check if token contains only punctuation characters"""
+            if not token:
+                return False
+            # Check if all characters are punctuation (using string.punctuation or our set)
+            import string
+            return all(c in string.punctuation for c in token)
+        def starts_with_punctuation(token):
+            """Check if token starts with punctuation"""
+            if not token:
+                return False
+            import string
+            return token[0] in string.punctuation
+        def ends_with_punctuation(token):
+            """Check if token ends with punctuation"""
+            if not token:
+                return False
+            import string
+            return token[-1] in string.punctuation
+        # Build the concatenated string and position mapping
+        reference_text_parts = []
         char_to_token_map = []
         current_pos = 0
         for idx, token in enumerate(reference_tokens):
+            # Determine if we need a space before this token
+            needs_space_before = False
+            if idx > 0:  # Not the first token
+                prev_token = reference_tokens[idx - 1]
+                # Add space if:
+                # - Current token is not punctuation-only AND
+                # - Previous token is not punctuation-only AND
+                # - Previous token didn't end with punctuation AND
+                # - Current token doesn't start with punctuation
+                if (
+                    not is_punctuation_only(token)
+                    and not is_punctuation_only(prev_token)
+                    and not ends_with_punctuation(prev_token)
+                    and not starts_with_punctuation(token)
+                ):
+                    needs_space_before = True
+            # Add space if needed
+            if needs_space_before:
+                current_pos += 1  # Account for the space
+            # Record token position in the concatenated string
+            token_start_in_text = current_pos
+            token_end_in_text = current_pos + len(token)
+            char_to_token_map.append(
+                (token_start_in_text, token_end_in_text, reference_indices[idx])
+            )
+            # Add token to the concatenated string
+            if needs_space_before:
+                reference_text_parts.append(" " + token)
+            else:
+                reference_text_parts.append(token)
+            # Move position forward by token length (and space if added)
+            current_pos = token_end_in_text
+        # Join all parts to create the final concatenated string
+        reference_text = "".join(reference_text_parts)
         # Find all regex matches
         try:
         all_found_matches = []
         query_index = search_df.index[0]  # Use the first (and only) query index
+        # Optimize overlap detection for large documents
+        # Instead of checking every token for every match (O(m*n)), we can use the fact that
+        # char_to_token_map is sorted by position. For each match, we only need to check
+        # tokens that could possibly overlap.
+        # For each regex match found in the concatenated string:
+        # 1. Get the match's start and end character positions
+        # 2. Find all tokens whose character ranges overlap with the match
+        # 3. Include all overlapping tokens in the results
+        # This ensures patterns spanning multiple tokens are captured correctly
+        # Optimization: Use a set to track which tokens we've already found
+        # This prevents duplicates if multiple matches overlap the same tokens
+        found_token_indices = set()
         for match in matches:
             match_start = match.start()
             match_end = match.end()
             # Find all tokens that overlap with this match
+            # A token overlaps if: token_start < match_end AND token_end > match_start
+            # Optimization: Since char_to_token_map is sorted by start position,
+            # we can stop early once we pass match_end, but we still need to check
+            # tokens that start before match_end (they might extend into the match)
             matching_token_indices = []
             for token_start, token_end, token_idx in char_to_token_map:
+                # Early exit optimization: if token starts after match ends, no more overlaps possible
+                # (This works because tokens are processed in order)
+                if token_start >= match_end:
+                    break
+                # Check if token overlaps with match (not disjoint)
+                if (
+                    token_end > match_start
+                ):  # token_start < match_end already checked by break above
                     matching_token_indices.append(token_idx)
+            # Create matches for all tokens that overlap with the regex match
+            # This ensures patterns spanning multiple tokens are captured
             for token_idx in matching_token_indices:
+                if token_idx not in found_token_indices:
+                    all_found_matches.append((query_index, token_idx, 1))
+                    found_token_indices.add(token_idx)
         print(
             f"Found {len(matches)} regex match(es) spanning {len(set(idx for _, idx, _ in all_found_matches))} token(s)"

tools/load_spacy_model_custom_recognisers.py CHANGED Viewed

@@ -352,19 +352,139 @@ def download_tesseract_lang_pack(
 #### Custom recognisers
 def custom_word_list_recogniser(custom_list: List[str] = list()):
     # Create regex pattern, handling quotes carefully
     quote_str = '"'
     replace_str = '(?:"|"|")'
-    custom_regex = "|".join(
-        rf"(?<!\w){re.escape(term.strip()).replace(quote_str, replace_str)}(?!\w)"
-        for term in custom_list
-    )
-    # print(custom_regex)
-    custom_pattern = Pattern(name="custom_pattern", regex=custom_regex, score=1)
     custom_recogniser = PatternRecognizer(
         supported_entity="CUSTOM",

 #### Custom recognisers
+def _is_regex_pattern(term: str) -> bool:
+    """
+    Detect if a term is intended to be a regex pattern or a literal string.
+    Args:
+        term: The term to check
+    Returns:
+        True if the term appears to be a regex pattern, False if it's a literal string
+    """
+    term = term.strip()
+    if not term:
+        return False
+    # First, try to compile as regex to validate it
+    # This catches patterns like \d\d\d-\d\d\d that use regex escape sequences
+    try:
+        re.compile(term)
+        is_valid_regex = True
+    except re.error:
+        # If it doesn't compile as regex, treat as literal
+        return False
+    # If it compiles, check if it contains regex-like features
+    # Regex metacharacters that suggest a pattern (excluding escaped literals)
+    regex_metacharacters = [
+        "+",
+        "*",
+        "?",
+        "{",
+        "}",
+        "[",
+        "]",
+        "(",
+        ")",
+        "|",
+        "^",
+        "$",
+        ".",
+    ]
+    # Common regex escape sequences that indicate regex intent
+    regex_escape_sequences = [
+        "\\d",
+        "\\w",
+        "\\s",
+        "\\D",
+        "\\W",
+        "\\S",
+        "\\b",
+        "\\B",
+        "\\n",
+        "\\t",
+        "\\r",
+    ]
+    # Check if term contains regex metacharacters or escape sequences
+    has_metacharacters = False
+    has_escape_sequences = False
+    i = 0
+    while i < len(term):
+        if term[i] == "\\" and i + 1 < len(term):
+            # Check if it's a regex escape sequence
+            escape_seq = term[i : i + 2]
+            if escape_seq in regex_escape_sequences:
+                has_escape_sequences = True
+            # Skip the escape sequence (backslash + next char)
+            i += 2
+            continue
+        if term[i] in regex_metacharacters:
+            has_metacharacters = True
+        i += 1
+    # If it's a valid regex and contains regex features, treat as regex pattern
+    if is_valid_regex and (has_metacharacters or has_escape_sequences):
+        return True
+    # If it compiles but has no regex features, it might be a literal that happens to compile
+    # (e.g., "test" compiles as regex but is just literal text)
+    # In this case, if it has escape sequences, it's definitely regex
+    if has_escape_sequences:
+        return True
+    # Otherwise, treat as literal
+    return False
 def custom_word_list_recogniser(custom_list: List[str] = list()):
     # Create regex pattern, handling quotes carefully
+    # Supports both literal strings and regex patterns
     quote_str = '"'
     replace_str = '(?:"|"|")'
+    regex_patterns = []
+    literal_patterns = []
+    # Separate regex patterns from literal strings
+    for term in custom_list:
+        term = term.strip()
+        if not term:
+            continue
+        if _is_regex_pattern(term):
+            # Use regex pattern as-is (but wrap with word boundaries if appropriate)
+            # Note: Word boundaries might not be appropriate for all regex patterns
+            # (e.g., email patterns), so we'll add them conditionally
+            regex_patterns.append(term)
+        else:
+            # Escape literal strings and add word boundaries
+            escaped_term = re.escape(term).replace(quote_str, replace_str)
+            literal_patterns.append(rf"(?<!\w){escaped_term}(?!\w)")
+    # Combine patterns: regex patterns first, then literal patterns
+    all_patterns = []
+    # Add regex patterns (without word boundaries, as they may have their own)
+    for pattern in regex_patterns:
+        all_patterns.append(f"({pattern})")
+    # Add literal patterns (with word boundaries)
+    all_patterns.extend(literal_patterns)
+    if not all_patterns:
+        # Return empty recognizer if no patterns
+        custom_pattern = Pattern(
+            name="custom_pattern", regex="(?!)", score=1
+        )  # Never matches
+    else:
+        custom_regex = "|".join(all_patterns)
+        # print(custom_regex)
+        custom_pattern = Pattern(name="custom_pattern", regex=custom_regex, score=1)
     custom_recogniser = PatternRecognizer(
         supported_entity="CUSTOM",