Spaces:

seanpedrickcase
/

document_redaction

Running

App Files Files Community

seanpedrickcase commited on Jan 16

Commit

3187788

1 Parent(s): ad2d759

Dropdown choices for redactions are now listed correctly

Browse files

Files changed (3) hide show

README.md +3 -5
app.py +1 -5
tools/redaction_review.py +20 -5

README.md CHANGED Viewed

@@ -1,8 +1,8 @@
 ---
 title: Document redaction
-emoji: 😎
 colorFrom: blue
-colorTo: green
 sdk: docker
 app_file: app.py
 pinned: false
@@ -12,9 +12,7 @@ license: agpl-3.0
 Redact personally identifiable information (PII) from documents (pdf, images), open text, or tabular data (xlsx/csv/parquet). Please see the [User Guide](#user-guide) for a walkthrough on how to use the app. Below is a very brief overview.
-To identify text in documents, the 'local' text/OCR image analysis uses spacy/tesseract, and works ok for documents with typed text. If available, choose 'AWS Textract service' to redact more complex elements e.g. signatures or handwriting.
-Then, choose a method for PII identification. 'Local' is quick and gives good results if you are primarily looking for a custom list of terms to redact (see Redaction settings). If available, AWS Comprehend gives better results at a small cost.
 After redaction, review suggested redactions on the 'Review redactions' tab. The original pdf can be uploaded here alongside a '...redaction_file.csv' to continue a previous redaction/review task. See the 'Redaction settings' tab to choose which pages to redact, the type of information to redact (e.g. people, places), or custom terms to always include/ exclude from redaction.

 ---
 title: Document redaction
+emoji: 📝
 colorFrom: blue
+colorTo: yellow
 sdk: docker
 app_file: app.py
 pinned: false
 Redact personally identifiable information (PII) from documents (pdf, images), open text, or tabular data (xlsx/csv/parquet). Please see the [User Guide](#user-guide) for a walkthrough on how to use the app. Below is a very brief overview.
+To identify text in documents, the 'local' text/OCR image analysis uses spacy/tesseract, and works ok for documents with typed text. If available, choose 'AWS Textract service' to redact more complex elements e.g. signatures or handwriting. Then, choose a method for PII identification. 'Local' is quick and gives good results if you are primarily looking for a custom list of terms to redact (see Redaction settings). If available, AWS Comprehend gives better results at a small cost.
 After redaction, review suggested redactions on the 'Review redactions' tab. The original pdf can be uploaded here alongside a '...redaction_file.csv' to continue a previous redaction/review task. See the 'Redaction settings' tab to choose which pages to redact, the type of information to redact (e.g. people, places), or custom terms to always include/ exclude from redaction.

app.py CHANGED Viewed

@@ -41,8 +41,6 @@ full_entity_list = ["TITLES", "PERSON", "PHONE_NUMBER", "EMAIL_ADDRESS", "STREET
 language = 'en'
 host_name = socket.gethostname()
 feedback_logs_folder = 'feedback/' + today_rev + '/' + host_name + '/'
 access_logs_folder = 'logs/' + today_rev + '/' + host_name + '/'
@@ -160,9 +158,7 @@ with app:
     Redact personally identifiable information (PII) from documents (pdf, images), open text, or tabular data (xlsx/csv/parquet). Please see the [User Guide](https://github.com/seanpedrick-case/doc_redaction/blob/main/README.md) for a walkthrough on how to use the app. Below is a very brief overview.
-    To identify text in documents, the 'local' text/OCR image analysis uses spacy/tesseract, and works ok for documents with typed text. If available, choose 'AWS Textract service' to redact more complex elements e.g. signatures or handwriting.
-    Then, choose a method for PII identification. 'Local' is quick and gives good results if you are primarily looking for a custom list of terms to redact (see Redaction settings). If available, AWS Comprehend gives better results at a small cost.
     After redaction, review suggested redactions on the 'Review redactions' tab. The original pdf can be uploaded here alongside a '...redaction_file.csv' to continue a previous redaction/review task. See the 'Redaction settings' tab to choose which pages to redact, the type of information to redact (e.g. people, places), or custom terms to always include/ exclude from redaction.

 language = 'en'
 host_name = socket.gethostname()
 feedback_logs_folder = 'feedback/' + today_rev + '/' + host_name + '/'
 access_logs_folder = 'logs/' + today_rev + '/' + host_name + '/'
     Redact personally identifiable information (PII) from documents (pdf, images), open text, or tabular data (xlsx/csv/parquet). Please see the [User Guide](https://github.com/seanpedrick-case/doc_redaction/blob/main/README.md) for a walkthrough on how to use the app. Below is a very brief overview.
+    To identify text in documents, the 'local' text/OCR image analysis uses spacy/tesseract, and works ok for documents with typed text. If available, choose 'AWS Textract service' to redact more complex elements e.g. signatures or handwriting. Then, choose a method for PII identification. 'Local' is quick and gives good results if you are primarily looking for a custom list of terms to redact (see Redaction settings). If available, AWS Comprehend gives better results at a small cost.
     After redaction, review suggested redactions on the 'Review redactions' tab. The original pdf can be uploaded here alongside a '...redaction_file.csv' to continue a previous redaction/review task. See the 'Redaction settings' tab to choose which pages to redact, the type of information to redact (e.g. people, places), or custom terms to always include/ exclude from redaction.

tools/redaction_review.py CHANGED Viewed

@@ -74,22 +74,30 @@ def remove_duplicate_images_with_blank_boxes(data: List[dict]) -> List[dict]:
     return result
 def get_recogniser_dataframe_out(image_annotator_object, recogniser_dataframe_gr):
     try:
         review_dataframe = convert_review_json_to_pandas_df(image_annotator_object)[["page", "label"]]
         recogniser_entities = review_dataframe["label"].unique().tolist()
         recogniser_entities.append("ALL")
-        recogniser_entities = sorted(recogniser_entities)
         recogniser_dataframe_out = gr.Dataframe(review_dataframe)
-        recogniser_entities_drop = gr.Dropdown(value=recogniser_entities[0], choices=recogniser_entities, allow_custom_value=True, interactive=True)
     except Exception as e:
         print("Could not extract recogniser information:", e)
         recogniser_dataframe_out = recogniser_dataframe_gr
         recogniser_entities_drop = gr.Dropdown(value="", choices=[""], allow_custom_value=True, interactive=True)
-        recogniser_entities = ["Redaction"]
-    return recogniser_dataframe_out, recogniser_dataframe_out, recogniser_entities_drop, recogniser_entities
 def update_annotator(image_annotator_object:AnnotatedImageData, page_num:int, recogniser_entities_drop=gr.Dropdown(value="ALL", allow_custom_value=True), recogniser_dataframe_gr=gr.Dataframe(pd.DataFrame(data={"page":[], "label":[]})), zoom:int=100):
     '''
@@ -105,8 +113,15 @@ def update_annotator(image_annotator_object:AnnotatedImageData, page_num:int, re
     else:
         review_dataframe = update_entities_df(recogniser_entities_drop, recogniser_dataframe_gr)
         recogniser_dataframe_out = gr.Dataframe(review_dataframe)
-        recogniser_entities_list = review_dataframe["label"].unique().tolist()
         recogniser_entities_list = sorted(recogniser_entities_list)
     zoom_str = str(zoom) + '%'
     recogniser_colour_list = [(0, 0, 0) for _ in range(len(recogniser_entities_list))]

     return result
 def get_recogniser_dataframe_out(image_annotator_object, recogniser_dataframe_gr):
+    recogniser_entities_list = ["Redaction"]
+    recogniser_entities_drop = gr.Dropdown(value="", choices=[""], allow_custom_value=True, interactive=True)
+    recogniser_dataframe_out = recogniser_dataframe_gr
     try:
         review_dataframe = convert_review_json_to_pandas_df(image_annotator_object)[["page", "label"]]
         recogniser_entities = review_dataframe["label"].unique().tolist()
         recogniser_entities.append("ALL")
+        recogniser_entities_for_drop = sorted(recogniser_entities)
         recogniser_dataframe_out = gr.Dataframe(review_dataframe)
+        recogniser_entities_drop = gr.Dropdown(value=recogniser_entities_for_drop[0], choices=recogniser_entities_for_drop, allow_custom_value=True, interactive=True)
+        recogniser_entities_list = [entity for entity in recogniser_entities_for_drop if entity != 'Redaction' and entity != 'ALL']  # Remove any existing 'Redaction'
+        recogniser_entities_list.insert(0, 'Redaction')  # Add 'Redaction' to the start of the list
     except Exception as e:
         print("Could not extract recogniser information:", e)
         recogniser_dataframe_out = recogniser_dataframe_gr
         recogniser_entities_drop = gr.Dropdown(value="", choices=[""], allow_custom_value=True, interactive=True)
+        recogniser_entities_list = ["Redaction"]
+    return recogniser_dataframe_out, recogniser_dataframe_out, recogniser_entities_drop, recogniser_entities_list
 def update_annotator(image_annotator_object:AnnotatedImageData, page_num:int, recogniser_entities_drop=gr.Dropdown(value="ALL", allow_custom_value=True), recogniser_dataframe_gr=gr.Dataframe(pd.DataFrame(data={"page":[], "label":[]})), zoom:int=100):
     '''
     else:
         review_dataframe = update_entities_df(recogniser_entities_drop, recogniser_dataframe_gr)
         recogniser_dataframe_out = gr.Dataframe(review_dataframe)
+        recogniser_entities_list = recogniser_dataframe_gr["label"].unique().tolist()
+        print("recogniser_entities_list all options:", recogniser_entities_list)
         recogniser_entities_list = sorted(recogniser_entities_list)
+        recogniser_entities_list = [entity for entity in recogniser_entities_list if entity != 'Redaction']  # Remove any existing 'Redaction'
+        recogniser_entities_list.insert(0, 'Redaction')  # Add 'Redaction' to the start of the list
+        print("recogniser_entities_list:", recogniser_entities_list)
     zoom_str = str(zoom) + '%'
     recogniser_colour_list = [(0, 0, 0) for _ in range(len(recogniser_entities_list))]