seanpedrickcase commited on
Commit
3187788
Β·
1 Parent(s): ad2d759

Dropdown choices for redactions are now listed correctly

Browse files
Files changed (3) hide show
  1. README.md +3 -5
  2. app.py +1 -5
  3. tools/redaction_review.py +20 -5
README.md CHANGED
@@ -1,8 +1,8 @@
1
  ---
2
  title: Document redaction
3
- emoji: 😎
4
  colorFrom: blue
5
- colorTo: green
6
  sdk: docker
7
  app_file: app.py
8
  pinned: false
@@ -12,9 +12,7 @@ license: agpl-3.0
12
 
13
  Redact personally identifiable information (PII) from documents (pdf, images), open text, or tabular data (xlsx/csv/parquet). Please see the [User Guide](#user-guide) for a walkthrough on how to use the app. Below is a very brief overview.
14
 
15
- To identify text in documents, the 'local' text/OCR image analysis uses spacy/tesseract, and works ok for documents with typed text. If available, choose 'AWS Textract service' to redact more complex elements e.g. signatures or handwriting.
16
-
17
- Then, choose a method for PII identification. 'Local' is quick and gives good results if you are primarily looking for a custom list of terms to redact (see Redaction settings). If available, AWS Comprehend gives better results at a small cost.
18
 
19
  After redaction, review suggested redactions on the 'Review redactions' tab. The original pdf can be uploaded here alongside a '...redaction_file.csv' to continue a previous redaction/review task. See the 'Redaction settings' tab to choose which pages to redact, the type of information to redact (e.g. people, places), or custom terms to always include/ exclude from redaction.
20
 
 
1
  ---
2
  title: Document redaction
3
+ emoji: πŸ“
4
  colorFrom: blue
5
+ colorTo: yellow
6
  sdk: docker
7
  app_file: app.py
8
  pinned: false
 
12
 
13
  Redact personally identifiable information (PII) from documents (pdf, images), open text, or tabular data (xlsx/csv/parquet). Please see the [User Guide](#user-guide) for a walkthrough on how to use the app. Below is a very brief overview.
14
 
15
+ To identify text in documents, the 'local' text/OCR image analysis uses spacy/tesseract, and works ok for documents with typed text. If available, choose 'AWS Textract service' to redact more complex elements e.g. signatures or handwriting. Then, choose a method for PII identification. 'Local' is quick and gives good results if you are primarily looking for a custom list of terms to redact (see Redaction settings). If available, AWS Comprehend gives better results at a small cost.
 
 
16
 
17
  After redaction, review suggested redactions on the 'Review redactions' tab. The original pdf can be uploaded here alongside a '...redaction_file.csv' to continue a previous redaction/review task. See the 'Redaction settings' tab to choose which pages to redact, the type of information to redact (e.g. people, places), or custom terms to always include/ exclude from redaction.
18
 
app.py CHANGED
@@ -41,8 +41,6 @@ full_entity_list = ["TITLES", "PERSON", "PHONE_NUMBER", "EMAIL_ADDRESS", "STREET
41
 
42
  language = 'en'
43
 
44
-
45
-
46
  host_name = socket.gethostname()
47
  feedback_logs_folder = 'feedback/' + today_rev + '/' + host_name + '/'
48
  access_logs_folder = 'logs/' + today_rev + '/' + host_name + '/'
@@ -160,9 +158,7 @@ with app:
160
 
161
  Redact personally identifiable information (PII) from documents (pdf, images), open text, or tabular data (xlsx/csv/parquet). Please see the [User Guide](https://github.com/seanpedrick-case/doc_redaction/blob/main/README.md) for a walkthrough on how to use the app. Below is a very brief overview.
162
 
163
- To identify text in documents, the 'local' text/OCR image analysis uses spacy/tesseract, and works ok for documents with typed text. If available, choose 'AWS Textract service' to redact more complex elements e.g. signatures or handwriting.
164
-
165
- Then, choose a method for PII identification. 'Local' is quick and gives good results if you are primarily looking for a custom list of terms to redact (see Redaction settings). If available, AWS Comprehend gives better results at a small cost.
166
 
167
  After redaction, review suggested redactions on the 'Review redactions' tab. The original pdf can be uploaded here alongside a '...redaction_file.csv' to continue a previous redaction/review task. See the 'Redaction settings' tab to choose which pages to redact, the type of information to redact (e.g. people, places), or custom terms to always include/ exclude from redaction.
168
 
 
41
 
42
  language = 'en'
43
 
 
 
44
  host_name = socket.gethostname()
45
  feedback_logs_folder = 'feedback/' + today_rev + '/' + host_name + '/'
46
  access_logs_folder = 'logs/' + today_rev + '/' + host_name + '/'
 
158
 
159
  Redact personally identifiable information (PII) from documents (pdf, images), open text, or tabular data (xlsx/csv/parquet). Please see the [User Guide](https://github.com/seanpedrick-case/doc_redaction/blob/main/README.md) for a walkthrough on how to use the app. Below is a very brief overview.
160
 
161
+ To identify text in documents, the 'local' text/OCR image analysis uses spacy/tesseract, and works ok for documents with typed text. If available, choose 'AWS Textract service' to redact more complex elements e.g. signatures or handwriting. Then, choose a method for PII identification. 'Local' is quick and gives good results if you are primarily looking for a custom list of terms to redact (see Redaction settings). If available, AWS Comprehend gives better results at a small cost.
 
 
162
 
163
  After redaction, review suggested redactions on the 'Review redactions' tab. The original pdf can be uploaded here alongside a '...redaction_file.csv' to continue a previous redaction/review task. See the 'Redaction settings' tab to choose which pages to redact, the type of information to redact (e.g. people, places), or custom terms to always include/ exclude from redaction.
164
 
tools/redaction_review.py CHANGED
@@ -74,22 +74,30 @@ def remove_duplicate_images_with_blank_boxes(data: List[dict]) -> List[dict]:
74
  return result
75
 
76
  def get_recogniser_dataframe_out(image_annotator_object, recogniser_dataframe_gr):
 
 
 
 
77
  try:
78
  review_dataframe = convert_review_json_to_pandas_df(image_annotator_object)[["page", "label"]]
79
  recogniser_entities = review_dataframe["label"].unique().tolist()
80
  recogniser_entities.append("ALL")
81
- recogniser_entities = sorted(recogniser_entities)
 
82
 
83
  recogniser_dataframe_out = gr.Dataframe(review_dataframe)
84
- recogniser_entities_drop = gr.Dropdown(value=recogniser_entities[0], choices=recogniser_entities, allow_custom_value=True, interactive=True)
 
 
 
85
 
86
  except Exception as e:
87
  print("Could not extract recogniser information:", e)
88
  recogniser_dataframe_out = recogniser_dataframe_gr
89
  recogniser_entities_drop = gr.Dropdown(value="", choices=[""], allow_custom_value=True, interactive=True)
90
- recogniser_entities = ["Redaction"]
91
 
92
- return recogniser_dataframe_out, recogniser_dataframe_out, recogniser_entities_drop, recogniser_entities
93
 
94
  def update_annotator(image_annotator_object:AnnotatedImageData, page_num:int, recogniser_entities_drop=gr.Dropdown(value="ALL", allow_custom_value=True), recogniser_dataframe_gr=gr.Dataframe(pd.DataFrame(data={"page":[], "label":[]})), zoom:int=100):
95
  '''
@@ -105,8 +113,15 @@ def update_annotator(image_annotator_object:AnnotatedImageData, page_num:int, re
105
  else:
106
  review_dataframe = update_entities_df(recogniser_entities_drop, recogniser_dataframe_gr)
107
  recogniser_dataframe_out = gr.Dataframe(review_dataframe)
108
- recogniser_entities_list = review_dataframe["label"].unique().tolist()
 
 
 
109
  recogniser_entities_list = sorted(recogniser_entities_list)
 
 
 
 
110
 
111
  zoom_str = str(zoom) + '%'
112
  recogniser_colour_list = [(0, 0, 0) for _ in range(len(recogniser_entities_list))]
 
74
  return result
75
 
76
  def get_recogniser_dataframe_out(image_annotator_object, recogniser_dataframe_gr):
77
+ recogniser_entities_list = ["Redaction"]
78
+ recogniser_entities_drop = gr.Dropdown(value="", choices=[""], allow_custom_value=True, interactive=True)
79
+ recogniser_dataframe_out = recogniser_dataframe_gr
80
+
81
  try:
82
  review_dataframe = convert_review_json_to_pandas_df(image_annotator_object)[["page", "label"]]
83
  recogniser_entities = review_dataframe["label"].unique().tolist()
84
  recogniser_entities.append("ALL")
85
+ recogniser_entities_for_drop = sorted(recogniser_entities)
86
+
87
 
88
  recogniser_dataframe_out = gr.Dataframe(review_dataframe)
89
+ recogniser_entities_drop = gr.Dropdown(value=recogniser_entities_for_drop[0], choices=recogniser_entities_for_drop, allow_custom_value=True, interactive=True)
90
+
91
+ recogniser_entities_list = [entity for entity in recogniser_entities_for_drop if entity != 'Redaction' and entity != 'ALL'] # Remove any existing 'Redaction'
92
+ recogniser_entities_list.insert(0, 'Redaction') # Add 'Redaction' to the start of the list
93
 
94
  except Exception as e:
95
  print("Could not extract recogniser information:", e)
96
  recogniser_dataframe_out = recogniser_dataframe_gr
97
  recogniser_entities_drop = gr.Dropdown(value="", choices=[""], allow_custom_value=True, interactive=True)
98
+ recogniser_entities_list = ["Redaction"]
99
 
100
+ return recogniser_dataframe_out, recogniser_dataframe_out, recogniser_entities_drop, recogniser_entities_list
101
 
102
  def update_annotator(image_annotator_object:AnnotatedImageData, page_num:int, recogniser_entities_drop=gr.Dropdown(value="ALL", allow_custom_value=True), recogniser_dataframe_gr=gr.Dataframe(pd.DataFrame(data={"page":[], "label":[]})), zoom:int=100):
103
  '''
 
113
  else:
114
  review_dataframe = update_entities_df(recogniser_entities_drop, recogniser_dataframe_gr)
115
  recogniser_dataframe_out = gr.Dataframe(review_dataframe)
116
+ recogniser_entities_list = recogniser_dataframe_gr["label"].unique().tolist()
117
+
118
+ print("recogniser_entities_list all options:", recogniser_entities_list)
119
+
120
  recogniser_entities_list = sorted(recogniser_entities_list)
121
+ recogniser_entities_list = [entity for entity in recogniser_entities_list if entity != 'Redaction'] # Remove any existing 'Redaction'
122
+ recogniser_entities_list.insert(0, 'Redaction') # Add 'Redaction' to the start of the list
123
+
124
+ print("recogniser_entities_list:", recogniser_entities_list)
125
 
126
  zoom_str = str(zoom) + '%'
127
  recogniser_colour_list = [(0, 0, 0) for _ in range(len(recogniser_entities_list))]