seanpedrickcase commited on
Commit
4852fb5
·
1 Parent(s): 419fb7d

Added regex functionality to deny lists. Corrected tesseract to word level parsing. Improved review search regex capabilities. Updated documentation

Browse files
README.md CHANGED
@@ -14,7 +14,7 @@ version: 1.5.2
14
 
15
  Redact personally identifiable information (PII) from documents (pdf, png, jpg), Word files (docx), or tabular data (xlsx/csv/parquet). Please see the [User Guide](#user-guide) for a full walkthrough of all the features in the app.
16
 
17
- To identify text in documents, the 'Local' text extraction uses PikePDF, and OCR image analysis uses Tesseract, and works well only for documents with typed text or scanned PDFs with clear text. Use AWS Textract to extract more complex elements e.g. handwriting, signatures, or unclear text. PaddleOCR and VLM support is also provided (see the installation instructions below).
18
 
19
  For PII identification, 'Local' (based on spaCy) gives good results if you are looking for common names or terms, or a custom list of terms to redact (see Redaction settings). AWS Comprehend gives better results at a small cost.
20
 
 
14
 
15
  Redact personally identifiable information (PII) from documents (pdf, png, jpg), Word files (docx), or tabular data (xlsx/csv/parquet). Please see the [User Guide](#user-guide) for a full walkthrough of all the features in the app.
16
 
17
+ To extract text from documents, the 'Local' options are PikePDF for PDFs with selectable text, and OCR with Tesseract. Use AWS Textract to extract more complex elements e.g. handwriting, signatures, or unclear text. PaddleOCR and VLM support is also provided (see the installation instructions below).
18
 
19
  For PII identification, 'Local' (based on spaCy) gives good results if you are looking for common names or terms, or a custom list of terms to redact (see Redaction settings). AWS Comprehend gives better results at a small cost.
20
 
app.py CHANGED
@@ -1034,7 +1034,7 @@ with blocks:
1034
 
1035
  Redact personally identifiable information (PII) from documents (pdf, png, jpg), Word files (docx), or tabular data (xlsx/csv/parquet). Please see the [User Guide]({USER_GUIDE_URL}) for a full walkthrough of all the features in the app.
1036
 
1037
- To identify text in documents, the 'Local' text extraction uses PikePDF, and OCR image analysis uses Tesseract, and works well only for documents with typed text or scanned PDFs with clear text. Use AWS Textract to extract more complex elements e.g. handwriting, signatures, or unclear text. For PII identification, 'Local' (based on spaCy) gives good results if you are looking for common names or terms, or a custom list of terms to redact (see Redaction settings). AWS Comprehend gives better results at a small cost.
1038
 
1039
  Additional options on the 'Redaction settings' include, the type of information to redact (e.g. people, places), custom terms to include/ exclude from redaction, fuzzy matching, language settings, and whole page redaction. After redaction is complete, you can view and modify suggested redactions on the 'Review redactions' tab to quickly create a final redacted document.
1040
 
@@ -1688,7 +1688,9 @@ with blocks:
1688
 
1689
  with gr.Row(equal_height=True):
1690
  multi_word_search_text = gr.Textbox(
1691
- label="Multi-word text search", value="", scale=4
 
 
1692
  )
1693
  multi_word_search_text_btn = gr.Button(
1694
  value="Search", scale=1
@@ -4756,7 +4758,8 @@ with blocks:
4756
  duplicate_files_out,
4757
  full_duplicate_data_by_file,
4758
  ],
4759
- api_name="word_level_ocr_text_search")
 
4760
 
4761
  # Clicking on a cell in the redact items table will take you to that page
4762
  all_page_line_level_ocr_results_with_words_df.select(
 
1034
 
1035
  Redact personally identifiable information (PII) from documents (pdf, png, jpg), Word files (docx), or tabular data (xlsx/csv/parquet). Please see the [User Guide]({USER_GUIDE_URL}) for a full walkthrough of all the features in the app.
1036
 
1037
+ To extract text from documents, the 'Local' options are PikePDF for PDFs with selectable text, and OCR with Tesseract. Use AWS Textract to extract more complex elements e.g. handwriting, signatures, or unclear text. For PII identification, 'Local' (based on spaCy) gives good results if you are looking for common names or terms, or a custom list of terms to redact (see Redaction settings). AWS Comprehend gives better results at a small cost.
1038
 
1039
  Additional options on the 'Redaction settings' include, the type of information to redact (e.g. people, places), custom terms to include/ exclude from redaction, fuzzy matching, language settings, and whole page redaction. After redaction is complete, you can view and modify suggested redactions on the 'Review redactions' tab to quickly create a final redacted document.
1040
 
 
1688
 
1689
  with gr.Row(equal_height=True):
1690
  multi_word_search_text = gr.Textbox(
1691
+ label="Multi-word text search (regex enabled below)",
1692
+ value="",
1693
+ scale=4,
1694
  )
1695
  multi_word_search_text_btn = gr.Button(
1696
  value="Search", scale=1
 
4758
  duplicate_files_out,
4759
  full_duplicate_data_by_file,
4760
  ],
4761
+ api_name="word_level_ocr_text_search",
4762
+ )
4763
 
4764
  # Clicking on a cell in the redact items table will take you to that page
4765
  all_page_line_level_ocr_results_with_words_df.select(
src/app_settings.qmd CHANGED
@@ -173,6 +173,10 @@ Configurations for the Gradio UI, server behavior, and application limits.
173
  * **Description:** If set to `"True"`, the application will be served via FastAPI, allowing for API endpoint integration.
174
  * **Default Value:** `"False"`
175
 
 
 
 
 
176
  * **`GRADIO_SERVER_NAME`**
177
  * **Description:** The IP address the Gradio server will bind to. Use `"0.0.0.0"` to allow external access.
178
  * **Default Value:** `"0.0.0.0"`
@@ -347,6 +351,10 @@ Configurations related to text extraction, PII detection, and the redaction proc
347
  * **Description:** Saves images with PaddleOCR's detected bounding boxes overlaid.
348
  * **Default Value:** `"False"`
349
 
 
 
 
 
350
  * **`PREPROCESS_LOCAL_OCR_IMAGES`**
351
  * **Description:** If set to `"True"`, images will be preprocessed before local OCR. Can slow down processing.
352
  * **Default Value:** `"True"`
@@ -367,6 +375,10 @@ Configurations related to text extraction, PII detection, and the redaction proc
367
  * **Description:** Tesseract PSM (Page Segmentation Mode) level to use for OCR. Valid values are 0-13.
368
  * **Default Value:** `11`
369
 
 
 
 
 
370
  * **`CONVERT_LINE_TO_WORD_LEVEL`**
371
  * **Description:** If set to `"True"`, converts PaddleOCR line-level OCR results to word-level for better precision.
372
  * **Default Value:** `"False"`
 
173
  * **Description:** If set to `"True"`, the application will be served via FastAPI, allowing for API endpoint integration.
174
  * **Default Value:** `"False"`
175
 
176
+ * **`RUN_MCP_SERVER`**
177
+ * **Description:** If set to `"True"`, the application will run as an MCP (Model Context Protocol) server.
178
+ * **Default Value:** `"False"`
179
+
180
  * **`GRADIO_SERVER_NAME`**
181
  * **Description:** The IP address the Gradio server will bind to. Use `"0.0.0.0"` to allow external access.
182
  * **Default Value:** `"0.0.0.0"`
 
351
  * **Description:** Saves images with PaddleOCR's detected bounding boxes overlaid.
352
  * **Default Value:** `"False"`
353
 
354
+ * **`SAVE_WORD_SEGMENTER_OUTPUT_IMAGES`**
355
+ * **Description:** If set to `"True"`, saves output images from the word segmenter for debugging purposes.
356
+ * **Default Value:** `"False"`
357
+
358
  * **`PREPROCESS_LOCAL_OCR_IMAGES`**
359
  * **Description:** If set to `"True"`, images will be preprocessed before local OCR. Can slow down processing.
360
  * **Default Value:** `"True"`
 
375
  * **Description:** Tesseract PSM (Page Segmentation Mode) level to use for OCR. Valid values are 0-13.
376
  * **Default Value:** `11`
377
 
378
+ * **`TESSERACT_WORD_LEVEL_OCR`**
379
+ * **Description:** If set to `"True"`, uses Tesseract word-level OCR instead of line-level.
380
+ * **Default Value:** `"True"`
381
+
382
  * **`CONVERT_LINE_TO_WORD_LEVEL`**
383
  * **Description:** If set to `"True"`, converts PaddleOCR line-level OCR results to word-level for better precision.
384
  * **Default Value:** `"False"`
src/user_guide.qmd CHANGED
@@ -194,6 +194,8 @@ To import this to use with your redaction tasks, go to the 'Redaction settings'
194
 
195
  Say you wanted to remove specific terms from a document. In this app you can do this by providing a custom deny list as a csv. Like for the allow list described above, this should be a one-column csv without a column header. The app will suggest each individual term in the list with exact spelling as whole words. So it won't select text from within words. To enable this feature, the 'CUSTOM' tag needs to be chosen as a redaction entity [(the process for adding/removing entity types to redact is described below)](#redacting-additional-types-of-personal-information).
196
 
 
 
197
  Here is an example using the [Partnership Agreement Toolkit file](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/Partnership-Agreement-Toolkit_0_0.pdf). This is an [example of a custom deny list file](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/allow_list/partnership_toolkit_redact_custom_deny_list.csv). 'Sister', 'Sister City'
198
  'Sister Cities', 'Friendship City' have been listed as specific terms to redact. You can see the outputs of this redaction process on the review page:
199
 
@@ -367,7 +369,7 @@ The workflow is designed to be simple: **Search → Select → Redact**.
367
 
368
  1. Navigate to the **"Search text to make new redactions"** tab.
369
  2. The main table will initially be populated with all the text extracted from the document for a page, broken down by word.
370
- 3. To narrow this down, use the **"Multi-word text search"** box to type the word or phrase you want to find (this will search the whole document). If you want to do a regex-based search, tick the 'Enable regex pattern matching' box under 'Search options' below (Note this will only be able to search for patterns in text within each cell).
371
  4. Click the **"Search"** button or press Enter.
372
  5. The table below will update to show only the rows containing text that matches your search query.
373
 
 
194
 
195
  Say you wanted to remove specific terms from a document. In this app you can do this by providing a custom deny list as a csv. Like for the allow list described above, this should be a one-column csv without a column header. The app will suggest each individual term in the list with exact spelling as whole words. So it won't select text from within words. To enable this feature, the 'CUSTOM' tag needs to be chosen as a redaction entity [(the process for adding/removing entity types to redact is described below)](#redacting-additional-types-of-personal-information).
196
 
197
+ **NOTE:** As of version 1.5.2, you can now provide deny list terms as regex patterns.
198
+
199
  Here is an example using the [Partnership Agreement Toolkit file](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/Partnership-Agreement-Toolkit_0_0.pdf). This is an [example of a custom deny list file](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/allow_list/partnership_toolkit_redact_custom_deny_list.csv). 'Sister', 'Sister City'
200
  'Sister Cities', 'Friendship City' have been listed as specific terms to redact. You can see the outputs of this redaction process on the review page:
201
 
 
369
 
370
  1. Navigate to the **"Search text to make new redactions"** tab.
371
  2. The main table will initially be populated with all the text extracted from the document for a page, broken down by word.
372
+ 3. To narrow this down, use the **"Multi-word text search"** box to type the word or phrase you want to find (this will search the whole document). If you want to do a regex-based search, tick the 'Enable regex pattern matching' box under 'Search options' below.
373
  4. Click the **"Search"** button or press Enter.
374
  5. The table below will update to show only the rows containing text that matches your search query.
375
 
tools/config.py CHANGED
@@ -281,7 +281,9 @@ FAVICON_PATH = get_or_create_env_var("FAVICON_PATH", "favicon.png")
281
 
282
  RUN_FASTAPI = convert_string_to_boolean(get_or_create_env_var("RUN_FASTAPI", "False"))
283
 
284
- RUN_MCP_SERVER = convert_string_to_boolean(get_or_create_env_var("RUN_MCP_SERVER", "False"))
 
 
285
 
286
  MAX_QUEUE_SIZE = int(get_or_create_env_var("MAX_QUEUE_SIZE", "5"))
287
 
 
281
 
282
  RUN_FASTAPI = convert_string_to_boolean(get_or_create_env_var("RUN_FASTAPI", "False"))
283
 
284
+ RUN_MCP_SERVER = convert_string_to_boolean(
285
+ get_or_create_env_var("RUN_MCP_SERVER", "False")
286
+ )
287
 
288
  MAX_QUEUE_SIZE = int(get_or_create_env_var("MAX_QUEUE_SIZE", "5"))
289
 
tools/custom_image_analyser_engine.py CHANGED
@@ -1240,7 +1240,7 @@ class CustomImageAnalyzerEngine:
1240
  print(
1241
  f"Warning: Image dimension mismatch! Expected {image_width}x{image_height}, but got {actual_width}x{actual_height}"
1242
  )
1243
- #print(f"Using actual dimensions: {actual_width}x{actual_height}")
1244
  # Update to use actual dimensions
1245
  image_width = actual_width
1246
  image_height = actual_height
@@ -1598,26 +1598,28 @@ class CustomImageAnalyzerEngine:
1598
  # Calculate line-level bounding boxes and average confidence
1599
  def _calculate_line_bbox(self, group):
1600
  # Get the leftmost and rightmost positions
1601
- left = group['left'].min()
1602
- top = group['top'].min()
1603
- right = (group['left'] + group['width']).max()
1604
- bottom = (group['top'] + group['height']).max()
1605
-
1606
  # Calculate width and height
1607
  width = right - left
1608
  height = bottom - top
1609
-
1610
  # Calculate average confidence
1611
- avg_conf = round(group['conf'].mean(), 0)
1612
-
1613
- return pd.Series({
1614
- 'text': ' '.join(group['text'].astype(str).tolist()),
1615
- 'left': left,
1616
- 'top': top,
1617
- 'width': width,
1618
- 'height': height,
1619
- 'conf': avg_conf
1620
- })
 
 
1621
 
1622
  def _perform_hybrid_ocr(
1623
  self,
@@ -1628,7 +1630,7 @@ class CustomImageAnalyzerEngine:
1628
  image_name: str = "unknown_image_name",
1629
  ) -> Dict[str, list]:
1630
  """
1631
- Performs hybrid OCR on an image using Tesseract for initial OCR and PaddleOCR/VLM to enhance
1632
  results for low-confidence or uncertain words.
1633
 
1634
  Args:
@@ -1637,12 +1639,12 @@ class CustomImageAnalyzerEngine:
1637
  re-analyzed with secondary OCR (PaddleOCR/VLM). Defaults to HYBRID_OCR_CONFIDENCE_THRESHOLD.
1638
  padding (int, optional): Pixel padding (in all directions) to add around each word box when
1639
  cropping for secondary OCR. Defaults to HYBRID_OCR_PADDING.
1640
- ocr (Optional[Any], optional): An instance of the PaddleOCR or VLM engine. If None, will use the
1641
  instance's `paddle_ocr` attribute if available. Only necessary for PaddleOCR-based pipelines.
1642
  image_name (str, optional): Optional name of the image, useful for debugging and visualization.
1643
 
1644
  Returns:
1645
- Dict[str, list]: OCR results in the dictionary format of pytesseract.image_to_data (keys:
1646
  'text', 'left', 'top', 'width', 'height', 'conf', 'model', ...).
1647
  """
1648
  # Determine if we're using VLM or PaddleOCR
@@ -1657,36 +1659,36 @@ class CustomImageAnalyzerEngine:
1657
  "No OCR object provided and 'paddle_ocr' is not initialized."
1658
  )
1659
 
1660
- #print("Starting hybrid OCR process...")
1661
 
1662
- # 1. Get initial word-level results from Tesseract
1663
  tesseract_data = pytesseract.image_to_data(
1664
  image,
1665
  output_type=pytesseract.Output.DICT,
1666
  config=self.tesseract_config,
1667
  lang=self.tesseract_lang,
1668
  )
1669
-
1670
  if TESSERACT_WORD_LEVEL_OCR is False:
1671
  ocr_df = pd.DataFrame(tesseract_data)
1672
-
1673
  # Filter out invalid entries (confidence == -1)
1674
  ocr_df = ocr_df[ocr_df.conf != -1]
1675
-
1676
  # Group by line and aggregate text
1677
- line_groups = ocr_df.groupby(['block_num', 'par_num', 'line_num'])
1678
-
1679
  ocr_data = line_groups.apply(self._calculate_line_bbox).reset_index()
1680
 
1681
  # Overwrite tesseract_data with the aggregated data
1682
  tesseract_data = {
1683
- 'text': ocr_data['text'].tolist(),
1684
- 'left': ocr_data['left'].astype(int).tolist(),
1685
- 'top': ocr_data['top'].astype(int).tolist(),
1686
- 'width': ocr_data['width'].astype(int).tolist(),
1687
- 'height': ocr_data['height'].astype(int).tolist(),
1688
- 'conf': ocr_data['conf'].tolist(),
1689
- 'model': ['Tesseract'] * len(ocr_data) # Add model field
1690
  }
1691
 
1692
  final_data = {
@@ -2262,24 +2264,24 @@ class CustomImageAnalyzerEngine:
2262
 
2263
  if TESSERACT_WORD_LEVEL_OCR is False:
2264
  ocr_df = pd.DataFrame(ocr_data)
2265
-
2266
  # Filter out invalid entries (confidence == -1)
2267
  ocr_df = ocr_df[ocr_df.conf != -1]
2268
-
2269
  # Group by line and aggregate text
2270
- line_groups = ocr_df.groupby(['block_num', 'par_num', 'line_num'])
2271
-
2272
- ocr_data = line_groups.apply(self._calculate_line_bbox).reset_index()
2273
 
2274
  # Convert DataFrame to dictionary of lists format expected by downstream code
2275
  ocr_data = {
2276
- 'text': ocr_data['text'].tolist(),
2277
- 'left': ocr_data['left'].astype(int).tolist(),
2278
- 'top': ocr_data['top'].astype(int).tolist(),
2279
- 'width': ocr_data['width'].astype(int).tolist(),
2280
- 'height': ocr_data['height'].astype(int).tolist(),
2281
- 'conf': ocr_data['conf'].tolist(),
2282
- 'model': ['Tesseract'] * len(ocr_data) # Add model field
2283
  }
2284
 
2285
  elif self.ocr_engine == "paddle" or self.ocr_engine == "hybrid-paddle-vlm":
@@ -2457,8 +2459,8 @@ class CustomImageAnalyzerEngine:
2457
 
2458
  # Convert line-level results to word-level if configured and needed
2459
  if CONVERT_LINE_TO_WORD_LEVEL and self._is_line_level_data(ocr_data):
2460
- #print("Converting line-level OCR results to word-level...")
2461
-
2462
  # Check if coordinates need to be scaled to match the image we're cropping from
2463
  # For PaddleOCR: _convert_paddle_to_tesseract_format converts coordinates to original image space
2464
  # - If PaddleOCR processed the original image (image_path provided), crop from original image (no scaling)
@@ -2496,7 +2498,10 @@ class CustomImageAnalyzerEngine:
2496
  elif self.ocr_engine == "tesseract":
2497
  # For Tesseract: if scale_factor != 1.0, rescale_ocr_data converted coordinates to original space
2498
  # So we need to crop from the original image, not the preprocessed image
2499
- if scale_factor != 1.0 and original_image_for_visualization is not None:
 
 
 
2500
  # Coordinates are in original space, so crop from original image
2501
  crop_image = original_image_for_visualization
2502
  crop_image_width = original_image_width
@@ -2589,7 +2594,6 @@ class CustomImageAnalyzerEngine:
2589
  def get_model(idx):
2590
  return default_model
2591
 
2592
-
2593
  output = [
2594
  OCRResult(
2595
  text=clean_unicode_text(ocr_result["text"][i]),
 
1240
  print(
1241
  f"Warning: Image dimension mismatch! Expected {image_width}x{image_height}, but got {actual_width}x{actual_height}"
1242
  )
1243
+ # print(f"Using actual dimensions: {actual_width}x{actual_height}")
1244
  # Update to use actual dimensions
1245
  image_width = actual_width
1246
  image_height = actual_height
 
1598
  # Calculate line-level bounding boxes and average confidence
1599
  def _calculate_line_bbox(self, group):
1600
  # Get the leftmost and rightmost positions
1601
+ left = group["left"].min()
1602
+ top = group["top"].min()
1603
+ right = (group["left"] + group["width"]).max()
1604
+ bottom = (group["top"] + group["height"]).max()
1605
+
1606
  # Calculate width and height
1607
  width = right - left
1608
  height = bottom - top
1609
+
1610
  # Calculate average confidence
1611
+ avg_conf = round(group["conf"].mean(), 0)
1612
+
1613
+ return pd.Series(
1614
+ {
1615
+ "text": " ".join(group["text"].astype(str).tolist()),
1616
+ "left": left,
1617
+ "top": top,
1618
+ "width": width,
1619
+ "height": height,
1620
+ "conf": avg_conf,
1621
+ }
1622
+ )
1623
 
1624
  def _perform_hybrid_ocr(
1625
  self,
 
1630
  image_name: str = "unknown_image_name",
1631
  ) -> Dict[str, list]:
1632
  """
1633
+ Performs hybrid OCR on an image using Tesseract for initial OCR and PaddleOCR/VLM to enhance
1634
  results for low-confidence or uncertain words.
1635
 
1636
  Args:
 
1639
  re-analyzed with secondary OCR (PaddleOCR/VLM). Defaults to HYBRID_OCR_CONFIDENCE_THRESHOLD.
1640
  padding (int, optional): Pixel padding (in all directions) to add around each word box when
1641
  cropping for secondary OCR. Defaults to HYBRID_OCR_PADDING.
1642
+ ocr (Optional[Any], optional): An instance of the PaddleOCR or VLM engine. If None, will use the
1643
  instance's `paddle_ocr` attribute if available. Only necessary for PaddleOCR-based pipelines.
1644
  image_name (str, optional): Optional name of the image, useful for debugging and visualization.
1645
 
1646
  Returns:
1647
+ Dict[str, list]: OCR results in the dictionary format of pytesseract.image_to_data (keys:
1648
  'text', 'left', 'top', 'width', 'height', 'conf', 'model', ...).
1649
  """
1650
  # Determine if we're using VLM or PaddleOCR
 
1659
  "No OCR object provided and 'paddle_ocr' is not initialized."
1660
  )
1661
 
1662
+ # print("Starting hybrid OCR process...")
1663
 
1664
+ # 1. Get initial word-level results from Tesseract
1665
  tesseract_data = pytesseract.image_to_data(
1666
  image,
1667
  output_type=pytesseract.Output.DICT,
1668
  config=self.tesseract_config,
1669
  lang=self.tesseract_lang,
1670
  )
1671
+
1672
  if TESSERACT_WORD_LEVEL_OCR is False:
1673
  ocr_df = pd.DataFrame(tesseract_data)
1674
+
1675
  # Filter out invalid entries (confidence == -1)
1676
  ocr_df = ocr_df[ocr_df.conf != -1]
1677
+
1678
  # Group by line and aggregate text
1679
+ line_groups = ocr_df.groupby(["block_num", "par_num", "line_num"])
1680
+
1681
  ocr_data = line_groups.apply(self._calculate_line_bbox).reset_index()
1682
 
1683
  # Overwrite tesseract_data with the aggregated data
1684
  tesseract_data = {
1685
+ "text": ocr_data["text"].tolist(),
1686
+ "left": ocr_data["left"].astype(int).tolist(),
1687
+ "top": ocr_data["top"].astype(int).tolist(),
1688
+ "width": ocr_data["width"].astype(int).tolist(),
1689
+ "height": ocr_data["height"].astype(int).tolist(),
1690
+ "conf": ocr_data["conf"].tolist(),
1691
+ "model": ["Tesseract"] * len(ocr_data), # Add model field
1692
  }
1693
 
1694
  final_data = {
 
2264
 
2265
  if TESSERACT_WORD_LEVEL_OCR is False:
2266
  ocr_df = pd.DataFrame(ocr_data)
2267
+
2268
  # Filter out invalid entries (confidence == -1)
2269
  ocr_df = ocr_df[ocr_df.conf != -1]
2270
+
2271
  # Group by line and aggregate text
2272
+ line_groups = ocr_df.groupby(["block_num", "par_num", "line_num"])
2273
+
2274
+ ocr_data = line_groups.apply(self._calculate_line_bbox).reset_index()
2275
 
2276
  # Convert DataFrame to dictionary of lists format expected by downstream code
2277
  ocr_data = {
2278
+ "text": ocr_data["text"].tolist(),
2279
+ "left": ocr_data["left"].astype(int).tolist(),
2280
+ "top": ocr_data["top"].astype(int).tolist(),
2281
+ "width": ocr_data["width"].astype(int).tolist(),
2282
+ "height": ocr_data["height"].astype(int).tolist(),
2283
+ "conf": ocr_data["conf"].tolist(),
2284
+ "model": ["Tesseract"] * len(ocr_data), # Add model field
2285
  }
2286
 
2287
  elif self.ocr_engine == "paddle" or self.ocr_engine == "hybrid-paddle-vlm":
 
2459
 
2460
  # Convert line-level results to word-level if configured and needed
2461
  if CONVERT_LINE_TO_WORD_LEVEL and self._is_line_level_data(ocr_data):
2462
+ # print("Converting line-level OCR results to word-level...")
2463
+
2464
  # Check if coordinates need to be scaled to match the image we're cropping from
2465
  # For PaddleOCR: _convert_paddle_to_tesseract_format converts coordinates to original image space
2466
  # - If PaddleOCR processed the original image (image_path provided), crop from original image (no scaling)
 
2498
  elif self.ocr_engine == "tesseract":
2499
  # For Tesseract: if scale_factor != 1.0, rescale_ocr_data converted coordinates to original space
2500
  # So we need to crop from the original image, not the preprocessed image
2501
+ if (
2502
+ scale_factor != 1.0
2503
+ and original_image_for_visualization is not None
2504
+ ):
2505
  # Coordinates are in original space, so crop from original image
2506
  crop_image = original_image_for_visualization
2507
  crop_image_width = original_image_width
 
2594
  def get_model(idx):
2595
  return default_model
2596
 
 
2597
  output = [
2598
  OCRResult(
2599
  text=clean_unicode_text(ocr_result["text"][i]),
tools/find_duplicate_pages.py CHANGED
@@ -854,21 +854,83 @@ def find_consecutive_sequence_matches(
854
  reference_tokens = reference_df["text_clean"].tolist()
855
  reference_indices = reference_df.index.tolist()
856
 
857
- # Join tokens with spaces to reconstruct the text
858
- # Note: If tokens were split at special characters like @, this may not perfectly reconstruct
859
- # the original text, but it's the best we can do with tokenized data
860
- reference_text = " ".join(reference_tokens)
861
-
862
- # Build a mapping from character positions to token indices
863
- # This helps us map regex match positions back to token indices
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
864
  char_to_token_map = []
865
  current_pos = 0
 
866
  for idx, token in enumerate(reference_tokens):
867
- token_start = current_pos
868
- token_end = current_pos + len(token)
869
- char_to_token_map.append((token_start, token_end, reference_indices[idx]))
870
- # Add 1 for the space separator (except after last token)
871
- current_pos = token_end + (1 if idx < len(reference_tokens) - 1 else 0)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
872
 
873
  # Find all regex matches
874
  try:
@@ -891,21 +953,49 @@ def find_consecutive_sequence_matches(
891
  all_found_matches = []
892
  query_index = search_df.index[0] # Use the first (and only) query index
893
 
894
- # For each regex match, find which tokens it spans
 
 
 
 
 
 
 
 
 
 
 
 
 
 
895
  for match in matches:
896
  match_start = match.start()
897
  match_end = match.end()
898
 
899
  # Find all tokens that overlap with this match
 
 
 
 
900
  matching_token_indices = []
901
  for token_start, token_end, token_idx in char_to_token_map:
902
- # Check if token overlaps with match
903
- if not (token_end < match_start or token_start > match_end):
 
 
 
 
 
 
 
904
  matching_token_indices.append(token_idx)
905
 
906
- # Create matches for all tokens in the span
 
907
  for token_idx in matching_token_indices:
908
- all_found_matches.append((query_index, token_idx, 1))
 
 
909
 
910
  print(
911
  f"Found {len(matches)} regex match(es) spanning {len(set(idx for _, idx, _ in all_found_matches))} token(s)"
 
854
  reference_tokens = reference_df["text_clean"].tolist()
855
  reference_indices = reference_df.index.tolist()
856
 
857
+ # Concatenate ALL tokens into a single continuous string with smart spacing
858
+ # Rules:
859
+ # - Words are joined with single spaces
860
+ # - Punctuation (periods, commas, etc.) touches adjacent tokens directly (no spaces)
861
+ # Example: ["Hi", ".", "How", "are", "you", "?", "Great"] -> "Hi.How are you?Great"
862
+ # This allows regex patterns to span multiple tokens naturally while preserving word boundaries
863
+
864
+ def is_punctuation_only(token):
865
+ """Check if token contains only punctuation characters"""
866
+ if not token:
867
+ return False
868
+ # Check if all characters are punctuation (using string.punctuation or our set)
869
+ import string
870
+
871
+ return all(c in string.punctuation for c in token)
872
+
873
+ def starts_with_punctuation(token):
874
+ """Check if token starts with punctuation"""
875
+ if not token:
876
+ return False
877
+ import string
878
+
879
+ return token[0] in string.punctuation
880
+
881
+ def ends_with_punctuation(token):
882
+ """Check if token ends with punctuation"""
883
+ if not token:
884
+ return False
885
+ import string
886
+
887
+ return token[-1] in string.punctuation
888
+
889
+ # Build the concatenated string and position mapping
890
+ reference_text_parts = []
891
  char_to_token_map = []
892
  current_pos = 0
893
+
894
  for idx, token in enumerate(reference_tokens):
895
+ # Determine if we need a space before this token
896
+ needs_space_before = False
897
+ if idx > 0: # Not the first token
898
+ prev_token = reference_tokens[idx - 1]
899
+ # Add space if:
900
+ # - Current token is not punctuation-only AND
901
+ # - Previous token is not punctuation-only AND
902
+ # - Previous token didn't end with punctuation AND
903
+ # - Current token doesn't start with punctuation
904
+ if (
905
+ not is_punctuation_only(token)
906
+ and not is_punctuation_only(prev_token)
907
+ and not ends_with_punctuation(prev_token)
908
+ and not starts_with_punctuation(token)
909
+ ):
910
+ needs_space_before = True
911
+
912
+ # Add space if needed
913
+ if needs_space_before:
914
+ current_pos += 1 # Account for the space
915
+
916
+ # Record token position in the concatenated string
917
+ token_start_in_text = current_pos
918
+ token_end_in_text = current_pos + len(token)
919
+ char_to_token_map.append(
920
+ (token_start_in_text, token_end_in_text, reference_indices[idx])
921
+ )
922
+
923
+ # Add token to the concatenated string
924
+ if needs_space_before:
925
+ reference_text_parts.append(" " + token)
926
+ else:
927
+ reference_text_parts.append(token)
928
+
929
+ # Move position forward by token length (and space if added)
930
+ current_pos = token_end_in_text
931
+
932
+ # Join all parts to create the final concatenated string
933
+ reference_text = "".join(reference_text_parts)
934
 
935
  # Find all regex matches
936
  try:
 
953
  all_found_matches = []
954
  query_index = search_df.index[0] # Use the first (and only) query index
955
 
956
+ # Optimize overlap detection for large documents
957
+ # Instead of checking every token for every match (O(m*n)), we can use the fact that
958
+ # char_to_token_map is sorted by position. For each match, we only need to check
959
+ # tokens that could possibly overlap.
960
+
961
+ # For each regex match found in the concatenated string:
962
+ # 1. Get the match's start and end character positions
963
+ # 2. Find all tokens whose character ranges overlap with the match
964
+ # 3. Include all overlapping tokens in the results
965
+ # This ensures patterns spanning multiple tokens are captured correctly
966
+
967
+ # Optimization: Use a set to track which tokens we've already found
968
+ # This prevents duplicates if multiple matches overlap the same tokens
969
+ found_token_indices = set()
970
+
971
  for match in matches:
972
  match_start = match.start()
973
  match_end = match.end()
974
 
975
  # Find all tokens that overlap with this match
976
+ # A token overlaps if: token_start < match_end AND token_end > match_start
977
+ # Optimization: Since char_to_token_map is sorted by start position,
978
+ # we can stop early once we pass match_end, but we still need to check
979
+ # tokens that start before match_end (they might extend into the match)
980
  matching_token_indices = []
981
  for token_start, token_end, token_idx in char_to_token_map:
982
+ # Early exit optimization: if token starts after match ends, no more overlaps possible
983
+ # (This works because tokens are processed in order)
984
+ if token_start >= match_end:
985
+ break
986
+
987
+ # Check if token overlaps with match (not disjoint)
988
+ if (
989
+ token_end > match_start
990
+ ): # token_start < match_end already checked by break above
991
  matching_token_indices.append(token_idx)
992
 
993
+ # Create matches for all tokens that overlap with the regex match
994
+ # This ensures patterns spanning multiple tokens are captured
995
  for token_idx in matching_token_indices:
996
+ if token_idx not in found_token_indices:
997
+ all_found_matches.append((query_index, token_idx, 1))
998
+ found_token_indices.add(token_idx)
999
 
1000
  print(
1001
  f"Found {len(matches)} regex match(es) spanning {len(set(idx for _, idx, _ in all_found_matches))} token(s)"
tools/load_spacy_model_custom_recognisers.py CHANGED
@@ -352,19 +352,139 @@ def download_tesseract_lang_pack(
352
 
353
 
354
  #### Custom recognisers
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
355
  def custom_word_list_recogniser(custom_list: List[str] = list()):
356
  # Create regex pattern, handling quotes carefully
 
357
 
358
  quote_str = '"'
359
  replace_str = '(?:"|"|")'
360
 
361
- custom_regex = "|".join(
362
- rf"(?<!\w){re.escape(term.strip()).replace(quote_str, replace_str)}(?!\w)"
363
- for term in custom_list
364
- )
365
- # print(custom_regex)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
366
 
367
- custom_pattern = Pattern(name="custom_pattern", regex=custom_regex, score=1)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
368
 
369
  custom_recogniser = PatternRecognizer(
370
  supported_entity="CUSTOM",
 
352
 
353
 
354
  #### Custom recognisers
355
+ def _is_regex_pattern(term: str) -> bool:
356
+ """
357
+ Detect if a term is intended to be a regex pattern or a literal string.
358
+
359
+ Args:
360
+ term: The term to check
361
+
362
+ Returns:
363
+ True if the term appears to be a regex pattern, False if it's a literal string
364
+ """
365
+ term = term.strip()
366
+ if not term:
367
+ return False
368
+
369
+ # First, try to compile as regex to validate it
370
+ # This catches patterns like \d\d\d-\d\d\d that use regex escape sequences
371
+ try:
372
+ re.compile(term)
373
+ is_valid_regex = True
374
+ except re.error:
375
+ # If it doesn't compile as regex, treat as literal
376
+ return False
377
+
378
+ # If it compiles, check if it contains regex-like features
379
+ # Regex metacharacters that suggest a pattern (excluding escaped literals)
380
+ regex_metacharacters = [
381
+ "+",
382
+ "*",
383
+ "?",
384
+ "{",
385
+ "}",
386
+ "[",
387
+ "]",
388
+ "(",
389
+ ")",
390
+ "|",
391
+ "^",
392
+ "$",
393
+ ".",
394
+ ]
395
+
396
+ # Common regex escape sequences that indicate regex intent
397
+ regex_escape_sequences = [
398
+ "\\d",
399
+ "\\w",
400
+ "\\s",
401
+ "\\D",
402
+ "\\W",
403
+ "\\S",
404
+ "\\b",
405
+ "\\B",
406
+ "\\n",
407
+ "\\t",
408
+ "\\r",
409
+ ]
410
+
411
+ # Check if term contains regex metacharacters or escape sequences
412
+ has_metacharacters = False
413
+ has_escape_sequences = False
414
+
415
+ i = 0
416
+ while i < len(term):
417
+ if term[i] == "\\" and i + 1 < len(term):
418
+ # Check if it's a regex escape sequence
419
+ escape_seq = term[i : i + 2]
420
+ if escape_seq in regex_escape_sequences:
421
+ has_escape_sequences = True
422
+ # Skip the escape sequence (backslash + next char)
423
+ i += 2
424
+ continue
425
+ if term[i] in regex_metacharacters:
426
+ has_metacharacters = True
427
+ i += 1
428
+
429
+ # If it's a valid regex and contains regex features, treat as regex pattern
430
+ if is_valid_regex and (has_metacharacters or has_escape_sequences):
431
+ return True
432
+
433
+ # If it compiles but has no regex features, it might be a literal that happens to compile
434
+ # (e.g., "test" compiles as regex but is just literal text)
435
+ # In this case, if it has escape sequences, it's definitely regex
436
+ if has_escape_sequences:
437
+ return True
438
+
439
+ # Otherwise, treat as literal
440
+ return False
441
+
442
+
443
  def custom_word_list_recogniser(custom_list: List[str] = list()):
444
  # Create regex pattern, handling quotes carefully
445
+ # Supports both literal strings and regex patterns
446
 
447
  quote_str = '"'
448
  replace_str = '(?:"|"|")'
449
 
450
+ regex_patterns = []
451
+ literal_patterns = []
452
+
453
+ # Separate regex patterns from literal strings
454
+ for term in custom_list:
455
+ term = term.strip()
456
+ if not term:
457
+ continue
458
+
459
+ if _is_regex_pattern(term):
460
+ # Use regex pattern as-is (but wrap with word boundaries if appropriate)
461
+ # Note: Word boundaries might not be appropriate for all regex patterns
462
+ # (e.g., email patterns), so we'll add them conditionally
463
+ regex_patterns.append(term)
464
+ else:
465
+ # Escape literal strings and add word boundaries
466
+ escaped_term = re.escape(term).replace(quote_str, replace_str)
467
+ literal_patterns.append(rf"(?<!\w){escaped_term}(?!\w)")
468
+
469
+ # Combine patterns: regex patterns first, then literal patterns
470
+ all_patterns = []
471
 
472
+ # Add regex patterns (without word boundaries, as they may have their own)
473
+ for pattern in regex_patterns:
474
+ all_patterns.append(f"({pattern})")
475
+
476
+ # Add literal patterns (with word boundaries)
477
+ all_patterns.extend(literal_patterns)
478
+
479
+ if not all_patterns:
480
+ # Return empty recognizer if no patterns
481
+ custom_pattern = Pattern(
482
+ name="custom_pattern", regex="(?!)", score=1
483
+ ) # Never matches
484
+ else:
485
+ custom_regex = "|".join(all_patterns)
486
+ # print(custom_regex)
487
+ custom_pattern = Pattern(name="custom_pattern", regex=custom_regex, score=1)
488
 
489
  custom_recogniser = PatternRecognizer(
490
  supported_entity="CUSTOM",