Commit
Β·
3187788
1
Parent(s):
ad2d759
Dropdown choices for redactions are now listed correctly
Browse files- README.md +3 -5
- app.py +1 -5
- tools/redaction_review.py +20 -5
README.md
CHANGED
@@ -1,8 +1,8 @@
|
|
1 |
---
|
2 |
title: Document redaction
|
3 |
-
emoji:
|
4 |
colorFrom: blue
|
5 |
-
colorTo:
|
6 |
sdk: docker
|
7 |
app_file: app.py
|
8 |
pinned: false
|
@@ -12,9 +12,7 @@ license: agpl-3.0
|
|
12 |
|
13 |
Redact personally identifiable information (PII) from documents (pdf, images), open text, or tabular data (xlsx/csv/parquet). Please see the [User Guide](#user-guide) for a walkthrough on how to use the app. Below is a very brief overview.
|
14 |
|
15 |
-
To identify text in documents, the 'local' text/OCR image analysis uses spacy/tesseract, and works ok for documents with typed text. If available, choose 'AWS Textract service' to redact more complex elements e.g. signatures or handwriting.
|
16 |
-
|
17 |
-
Then, choose a method for PII identification. 'Local' is quick and gives good results if you are primarily looking for a custom list of terms to redact (see Redaction settings). If available, AWS Comprehend gives better results at a small cost.
|
18 |
|
19 |
After redaction, review suggested redactions on the 'Review redactions' tab. The original pdf can be uploaded here alongside a '...redaction_file.csv' to continue a previous redaction/review task. See the 'Redaction settings' tab to choose which pages to redact, the type of information to redact (e.g. people, places), or custom terms to always include/ exclude from redaction.
|
20 |
|
|
|
1 |
---
|
2 |
title: Document redaction
|
3 |
+
emoji: π
|
4 |
colorFrom: blue
|
5 |
+
colorTo: yellow
|
6 |
sdk: docker
|
7 |
app_file: app.py
|
8 |
pinned: false
|
|
|
12 |
|
13 |
Redact personally identifiable information (PII) from documents (pdf, images), open text, or tabular data (xlsx/csv/parquet). Please see the [User Guide](#user-guide) for a walkthrough on how to use the app. Below is a very brief overview.
|
14 |
|
15 |
+
To identify text in documents, the 'local' text/OCR image analysis uses spacy/tesseract, and works ok for documents with typed text. If available, choose 'AWS Textract service' to redact more complex elements e.g. signatures or handwriting. Then, choose a method for PII identification. 'Local' is quick and gives good results if you are primarily looking for a custom list of terms to redact (see Redaction settings). If available, AWS Comprehend gives better results at a small cost.
|
|
|
|
|
16 |
|
17 |
After redaction, review suggested redactions on the 'Review redactions' tab. The original pdf can be uploaded here alongside a '...redaction_file.csv' to continue a previous redaction/review task. See the 'Redaction settings' tab to choose which pages to redact, the type of information to redact (e.g. people, places), or custom terms to always include/ exclude from redaction.
|
18 |
|
app.py
CHANGED
@@ -41,8 +41,6 @@ full_entity_list = ["TITLES", "PERSON", "PHONE_NUMBER", "EMAIL_ADDRESS", "STREET
|
|
41 |
|
42 |
language = 'en'
|
43 |
|
44 |
-
|
45 |
-
|
46 |
host_name = socket.gethostname()
|
47 |
feedback_logs_folder = 'feedback/' + today_rev + '/' + host_name + '/'
|
48 |
access_logs_folder = 'logs/' + today_rev + '/' + host_name + '/'
|
@@ -160,9 +158,7 @@ with app:
|
|
160 |
|
161 |
Redact personally identifiable information (PII) from documents (pdf, images), open text, or tabular data (xlsx/csv/parquet). Please see the [User Guide](https://github.com/seanpedrick-case/doc_redaction/blob/main/README.md) for a walkthrough on how to use the app. Below is a very brief overview.
|
162 |
|
163 |
-
To identify text in documents, the 'local' text/OCR image analysis uses spacy/tesseract, and works ok for documents with typed text. If available, choose 'AWS Textract service' to redact more complex elements e.g. signatures or handwriting.
|
164 |
-
|
165 |
-
Then, choose a method for PII identification. 'Local' is quick and gives good results if you are primarily looking for a custom list of terms to redact (see Redaction settings). If available, AWS Comprehend gives better results at a small cost.
|
166 |
|
167 |
After redaction, review suggested redactions on the 'Review redactions' tab. The original pdf can be uploaded here alongside a '...redaction_file.csv' to continue a previous redaction/review task. See the 'Redaction settings' tab to choose which pages to redact, the type of information to redact (e.g. people, places), or custom terms to always include/ exclude from redaction.
|
168 |
|
|
|
41 |
|
42 |
language = 'en'
|
43 |
|
|
|
|
|
44 |
host_name = socket.gethostname()
|
45 |
feedback_logs_folder = 'feedback/' + today_rev + '/' + host_name + '/'
|
46 |
access_logs_folder = 'logs/' + today_rev + '/' + host_name + '/'
|
|
|
158 |
|
159 |
Redact personally identifiable information (PII) from documents (pdf, images), open text, or tabular data (xlsx/csv/parquet). Please see the [User Guide](https://github.com/seanpedrick-case/doc_redaction/blob/main/README.md) for a walkthrough on how to use the app. Below is a very brief overview.
|
160 |
|
161 |
+
To identify text in documents, the 'local' text/OCR image analysis uses spacy/tesseract, and works ok for documents with typed text. If available, choose 'AWS Textract service' to redact more complex elements e.g. signatures or handwriting. Then, choose a method for PII identification. 'Local' is quick and gives good results if you are primarily looking for a custom list of terms to redact (see Redaction settings). If available, AWS Comprehend gives better results at a small cost.
|
|
|
|
|
162 |
|
163 |
After redaction, review suggested redactions on the 'Review redactions' tab. The original pdf can be uploaded here alongside a '...redaction_file.csv' to continue a previous redaction/review task. See the 'Redaction settings' tab to choose which pages to redact, the type of information to redact (e.g. people, places), or custom terms to always include/ exclude from redaction.
|
164 |
|
tools/redaction_review.py
CHANGED
@@ -74,22 +74,30 @@ def remove_duplicate_images_with_blank_boxes(data: List[dict]) -> List[dict]:
|
|
74 |
return result
|
75 |
|
76 |
def get_recogniser_dataframe_out(image_annotator_object, recogniser_dataframe_gr):
|
|
|
|
|
|
|
|
|
77 |
try:
|
78 |
review_dataframe = convert_review_json_to_pandas_df(image_annotator_object)[["page", "label"]]
|
79 |
recogniser_entities = review_dataframe["label"].unique().tolist()
|
80 |
recogniser_entities.append("ALL")
|
81 |
-
|
|
|
82 |
|
83 |
recogniser_dataframe_out = gr.Dataframe(review_dataframe)
|
84 |
-
recogniser_entities_drop = gr.Dropdown(value=
|
|
|
|
|
|
|
85 |
|
86 |
except Exception as e:
|
87 |
print("Could not extract recogniser information:", e)
|
88 |
recogniser_dataframe_out = recogniser_dataframe_gr
|
89 |
recogniser_entities_drop = gr.Dropdown(value="", choices=[""], allow_custom_value=True, interactive=True)
|
90 |
-
|
91 |
|
92 |
-
return recogniser_dataframe_out, recogniser_dataframe_out, recogniser_entities_drop,
|
93 |
|
94 |
def update_annotator(image_annotator_object:AnnotatedImageData, page_num:int, recogniser_entities_drop=gr.Dropdown(value="ALL", allow_custom_value=True), recogniser_dataframe_gr=gr.Dataframe(pd.DataFrame(data={"page":[], "label":[]})), zoom:int=100):
|
95 |
'''
|
@@ -105,8 +113,15 @@ def update_annotator(image_annotator_object:AnnotatedImageData, page_num:int, re
|
|
105 |
else:
|
106 |
review_dataframe = update_entities_df(recogniser_entities_drop, recogniser_dataframe_gr)
|
107 |
recogniser_dataframe_out = gr.Dataframe(review_dataframe)
|
108 |
-
recogniser_entities_list =
|
|
|
|
|
|
|
109 |
recogniser_entities_list = sorted(recogniser_entities_list)
|
|
|
|
|
|
|
|
|
110 |
|
111 |
zoom_str = str(zoom) + '%'
|
112 |
recogniser_colour_list = [(0, 0, 0) for _ in range(len(recogniser_entities_list))]
|
|
|
74 |
return result
|
75 |
|
76 |
def get_recogniser_dataframe_out(image_annotator_object, recogniser_dataframe_gr):
|
77 |
+
recogniser_entities_list = ["Redaction"]
|
78 |
+
recogniser_entities_drop = gr.Dropdown(value="", choices=[""], allow_custom_value=True, interactive=True)
|
79 |
+
recogniser_dataframe_out = recogniser_dataframe_gr
|
80 |
+
|
81 |
try:
|
82 |
review_dataframe = convert_review_json_to_pandas_df(image_annotator_object)[["page", "label"]]
|
83 |
recogniser_entities = review_dataframe["label"].unique().tolist()
|
84 |
recogniser_entities.append("ALL")
|
85 |
+
recogniser_entities_for_drop = sorted(recogniser_entities)
|
86 |
+
|
87 |
|
88 |
recogniser_dataframe_out = gr.Dataframe(review_dataframe)
|
89 |
+
recogniser_entities_drop = gr.Dropdown(value=recogniser_entities_for_drop[0], choices=recogniser_entities_for_drop, allow_custom_value=True, interactive=True)
|
90 |
+
|
91 |
+
recogniser_entities_list = [entity for entity in recogniser_entities_for_drop if entity != 'Redaction' and entity != 'ALL'] # Remove any existing 'Redaction'
|
92 |
+
recogniser_entities_list.insert(0, 'Redaction') # Add 'Redaction' to the start of the list
|
93 |
|
94 |
except Exception as e:
|
95 |
print("Could not extract recogniser information:", e)
|
96 |
recogniser_dataframe_out = recogniser_dataframe_gr
|
97 |
recogniser_entities_drop = gr.Dropdown(value="", choices=[""], allow_custom_value=True, interactive=True)
|
98 |
+
recogniser_entities_list = ["Redaction"]
|
99 |
|
100 |
+
return recogniser_dataframe_out, recogniser_dataframe_out, recogniser_entities_drop, recogniser_entities_list
|
101 |
|
102 |
def update_annotator(image_annotator_object:AnnotatedImageData, page_num:int, recogniser_entities_drop=gr.Dropdown(value="ALL", allow_custom_value=True), recogniser_dataframe_gr=gr.Dataframe(pd.DataFrame(data={"page":[], "label":[]})), zoom:int=100):
|
103 |
'''
|
|
|
113 |
else:
|
114 |
review_dataframe = update_entities_df(recogniser_entities_drop, recogniser_dataframe_gr)
|
115 |
recogniser_dataframe_out = gr.Dataframe(review_dataframe)
|
116 |
+
recogniser_entities_list = recogniser_dataframe_gr["label"].unique().tolist()
|
117 |
+
|
118 |
+
print("recogniser_entities_list all options:", recogniser_entities_list)
|
119 |
+
|
120 |
recogniser_entities_list = sorted(recogniser_entities_list)
|
121 |
+
recogniser_entities_list = [entity for entity in recogniser_entities_list if entity != 'Redaction'] # Remove any existing 'Redaction'
|
122 |
+
recogniser_entities_list.insert(0, 'Redaction') # Add 'Redaction' to the start of the list
|
123 |
+
|
124 |
+
print("recogniser_entities_list:", recogniser_entities_list)
|
125 |
|
126 |
zoom_str = str(zoom) + '%'
|
127 |
recogniser_colour_list = [(0, 0, 0) for _ in range(len(recogniser_entities_list))]
|