Wachu2005 commited on
Commit
99502af
1 Parent(s): 1361287

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +136 -39
README.md CHANGED
@@ -176,23 +176,122 @@ Looking forward to your direction on this matter.
176
 
177
 
178
 
 
 
 
179
  # Example: Process anonymized version with GPT4 and change enteties back
180
  ```python
181
  import torch
 
 
182
  from transformers import AutoTokenizer, AutoModelForCausalLM
 
183
  tokenizer = AutoTokenizer.from_pretrained("metricspace/EntityAnonymization-3B-V0.9")
184
  model = AutoModelForCausalLM.from_pretrained("metricspace/EntityAnonymization-3B-V0.9", torch_dtype=torch.bfloat16)
 
185
 
186
- import re
187
 
188
- def extract_last_assistant_response(input_text):
189
- # Find the occurrence of "ASSISTANT:" in the input text
190
- match = re.search(r'ASSISTANT:', input_text)
191
 
192
- # Get the index where the last "ASSISTANT:" ends
193
- start_index = match.end()
194
- response = input_text[start_index:].strip()
195
- return response
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
196
 
197
  import ast
198
 
@@ -208,59 +307,57 @@ def swap_keys_and_values_in_string(input_str):
208
 
209
  return swapped_str
210
 
211
- # sample text for entitity extraction and resampling
212
 
213
- original_text = '''Our organization, XYZ Biotech, operates at the forefront of groundbreaking pharmaceutical research, renowned for our pioneering drug development and breakthrough treatments. In light of the ever-evolving regulatory landscape and the need to safeguard our research endeavors, we've recognized the critical importance of enhancing our compliance and data security protocols. To this end, we are on the lookout for a top-notch regulatory affairs specialist to spearhead this transformation, ensuring the rigorous adherence to industry standards and the protection of our confidential research data.
 
 
 
 
 
214
 
215
- This comprehensive initiative encompasses not only ensuring regulatory compliance but also the implementation of three vital security measures. We will be utilizing CipherGuard's state-of-the-art encryption technology to secure our research data, deploying BioShield's advanced security protocols for laboratory access, and integrating SecureLabs' real-time data monitoring and threat detection systems.
 
 
216
 
217
- The enhancement of our regulatory affairs and data security measures is a critical component in safeguarding our proprietary research, reinforcing our commitment to drug development excellence. While we prioritize compliance and data protection, the user experience for our research teams and partners will remain user-friendly and efficient, whether they are using our proprietary research software, "BioDiscover," or our mobile applications.
218
 
219
- We are actively in search of a regulatory affairs specialist who can comprehend the importance of maintaining compliance and data security in our industry and who can deliver a comprehensive, airtight solution that not only ensures our adherence to regulations but also safeguards the confidential nature of our research at XYZ Biotech.''',
220
- '''
221
 
 
222
 
223
- # another different anonymized text with replaced entitie
224
 
225
- anonymized_text = '''ABC Pharmaceuticals, a renowned player in the pharmaceutical industry, is dedicated to pioneering drug development and breakthrough treatments. In response to the ever-evolving regulatory landscape and the need to protect our research initiatives, we have identified the paramount importance of enhancing our compliance and data security protocols. As a part of this strategic shift, we are actively searching for a top-tier regulatory affairs specialist to lead this transformation, ensuring unwavering adherence to industry standards and the safeguarding of our confidential research data.
226
 
227
- This comprehensive initiative goes beyond regulatory compliance and entails the implementation of three crucial security measures. We will be leveraging the cutting-edge encryption technology provided by CodeGuard to secure our research data, implementing BioProtect's advanced security protocols for laboratory access, and integrating the real-time data monitoring and threat detection systems offered by SecureTech.
228
 
229
- The enhancement of our regulatory affairs and data security measures is a pivotal component in safeguarding our proprietary research, reinforcing our commitment to excellence in drug development. While we prioritize compliance and data protection, the user experience for our research teams and partners will remain user-friendly and efficient, whether they are using our proprietary research software, "BioDiscover," or our mobile applications.
230
 
231
- We are actively seeking a regulatory affairs specialist who comprehends the critical importance of upholding compliance and data security in our industry and possesses the expertise to deliver a comprehensive and impervious solution that ensures not only our adherence to regulations but also preserves the confidentiality of our research data at ABC Pharmaceuticals.'''
232
 
 
233
 
234
- prompt = f'USER: Resample the entities: {original_text}\n\nASSISTANT:'
235
- inputs = tokenizer(prompt, return_tensors='pt').to('cuda')
236
- outputs = model.generate(inputs.input_ids, max_new_tokens=250, do_sample=False, top_k=50, top_p=0.98, num_beams=1)
237
- output_text_1 = tokenizer.decode(outputs[0], skip_special_tokens=True)
238
 
 
239
 
240
- generated_part = extract_assistant_response(output_text_1)
241
 
242
- # inverting the entity map
243
- # {'XYZ Biotech': 'ABC Pharmaceuticals', 'CipherGuard': 'CodeGuard', 'BioShield': 'BioProtect', 'SecureLabs': 'SecureTech', 'BioDiscover': 'BioDiscover'}
244
- # inverted to this:
245
- # {'ABC Pharmaceuticals': 'XYZ Biotech', 'CodeGuard': 'CipherGuard', 'BioProtect': 'BioShield', 'SecureTech': 'SecureLabs', 'BioDiscover': 'BioDiscover'}
246
 
247
- inverted_entities = swap_keys_and_values_in_string(generated_part)
248
 
 
249
 
250
- prompt_2 = f"USER: Rephrase with {inverted_entities}: {anonymized_text}\n\nASSISTANT:"
251
- inputs = tokenizer(prompt_2, return_tensors='pt').to('cuda')
252
- outputs = model.generate(inputs.input_ids, max_new_tokens=500, do_sample=False, top_k=50, top_p=0.98)
253
- output_text_2 = tokenizer.decode(outputs[0], skip_special_tokens=True)
254
 
255
- print(output_text_2)
256
 
 
257
  '''
258
- XYZ Biotech, a renowned player in the biotech industry, is dedicated to pioneering drug development and breakthrough treatments. In response to the ever-evolving regulatory landscape and the need to protect our research initiatives, we have identified the paramount importance of enhancing our compliance and data security protocols. As a part of this strategic shift, we are actively searching for a top-tier regulatory affairs specialist to lead this transformation, ensuring unwavering adherence to industry standards and the safeguarding of our confidential research data.
259
- This comprehensive initiative goes beyond regulatory compliance and entails the implementation of three crucial security measures. We will be leveraging the cutting-edge encryption technology provided by CipherGuard to secure our research data, implementing BioShield's advanced security protocols for laboratory access, and integrating the real-time data monitoring and threat detection systems offered by SecureLabs.
260
- The enhancement of our regulatory affairs and data security measures is a pivotal component in safeguarding our proprietary research, reinforcing our commitment to excellence in drug development. While we prioritize compliance and data protection, the user experience for our research teams and partners will remain user-friendly and efficient, whether they are using our proprietary research software, "BioDiscover," or our mobile applications.
261
- We are actively seeking a regulatory affairs specialist who comprehends the critical importance of upholding compliance and data security in our industry and possesses the expertise to deliver a comprehensive and impervious solution that ensures not only our adherence to regulations but also preserves the confidentiality of our research data at XYZ Biotech.
262
- '''
263
  ```
 
264
 
265
  # Dataset and Training Documentation for Audit
266
  If you require the original dataset used for training this model, or further documentation related to its training and architecture for audit purposes, you can request this information by contacting us.
 
176
 
177
 
178
 
179
+
180
+
181
+
182
  # Example: Process anonymized version with GPT4 and change enteties back
183
  ```python
184
  import torch
185
+ import json
186
+ import re
187
  from transformers import AutoTokenizer, AutoModelForCausalLM
188
+
189
  tokenizer = AutoTokenizer.from_pretrained("metricspace/EntityAnonymization-3B-V0.9")
190
  model = AutoModelForCausalLM.from_pretrained("metricspace/EntityAnonymization-3B-V0.9", torch_dtype=torch.bfloat16)
191
+ model.to("cuda:0")
192
 
 
193
 
194
+ # Anonymized input
195
+ anonymized_text = '''Subject: HR Incident Report: Speculation of Drug Misuse by Mr. Edward Martin
 
196
 
197
+ Dear Mrs. Charlotte Johnson,
198
+
199
+ I trust you're well. I wish to bring to your attention a concerning matter involving one of our esteemed employees, Mr. Edward Martin.
200
+
201
+ Employee Details:
202
+
203
+ Name: Edward Martin
204
+ Position: Senior Marketing Creative
205
+ Department: Marketing
206
+ Date of Joining: January 15, 2020
207
+ Reporting Manager: Mrs. Jane Anderson
208
+
209
+ Incident Details:
210
+ Date: October 25, 2023
211
+ Location: Restroom, topmost floor
212
+ Time: midday
213
+
214
+ Description of Incident:
215
+ On the date specified, a few colleagues reported unusual behavior exhibited by Mr. Martin, which raised concerns about potential drug misuse. Witnesses mentioned that Edward appeared disoriented and was found in the restroom for an extended period. Some employees also discovered unidentified pills in close proximity to his chair.
216
+
217
+ Witness Accounts:
218
+ Ms. Marie Foster: "Edward seemed distracted and not his usual self today. He's been taking frequent breaks and appears a bit disoriented."
219
+ Mr. Benjamin Adams: "I found some pills near his chair on the floor. It's concerning, and I felt it necessary to report."
220
+
221
+ Immediate Actions Taken:
222
+ Mr. Edward Martin was approached by People Management for a preliminary conversation to understand the situation.
223
+ Mrs. Jane Anderson, his reporting manager, was made aware of the concerns.
224
+
225
+ Recommendations:
226
+ It's crucial to have a private and supportive conversation with Mr. Martin to understand if there's an underlying issue.
227
+ Consider referring Edward to our Personal Assistance Program (PAP) for counseling or support.
228
+ It may be beneficial to organize a session on drug awareness and workplace safety for all employees.
229
+ It's of utmost importance to handle this situation with sensitivity and discretion, ensuring the wellbeing of Mr. Martin and maintaining the integrity of our workplace environment. This email serves as a formal documentation of the incident. We'll determine the subsequent course of action based on your guidance and the recommendations provided.
230
+
231
+ Looking forward to your direction on this matter.
232
+ '''
233
+
234
+
235
+ # Entities map
236
+
237
+ entities_map = '''
238
+ {
239
+ "Mr. Benjamin Mitchell": "Mr. Edward Martin",
240
+ "Mrs. Alice Williams": "Mrs. Charlotte Johnson",
241
+ "January 15, 2020": "January 15, 2020",
242
+ "Mrs. Jane Fitzgerald": "Mrs. Jane Anderson",
243
+ "October 25, 2023": "October 25, 2023",
244
+ "4th Floor": "topmost floor",
245
+ "11:45 AM": "midday",
246
+ "Emily Clark": "Marie Foster",
247
+ "Employee Assistance Program (EAP)": "Personal Assistance Program (PAP)",
248
+ "Robert Taylor": "Benjamin Adams",
249
+ }
250
+ '''
251
+
252
+
253
+
254
+ # Step 1: Processing anonymized text with GPT-4
255
+
256
+ import openai
257
+ openai.api_key = f"<API_KEY>"
258
+
259
+ completion = openai.ChatCompletion.create(
260
+ model="gpt-4",
261
+ messages=[
262
+ {"role": "user", "content": f"Write an official warning letter to the employee, that we do not tolerate it and with the next incident he gets fired in the name of Mrs. Charlotte Johnson Human Resources Manager. Here is the report with information. {anonymized_info}"}
263
+ ]
264
+ )
265
+
266
+ print(completion.choices[0].message.content)
267
+ #output
268
+ '''
269
+ Subject: Official Warning – Substance Misuse Policy Violation
270
+
271
+ Dear Mr. Edward Martin,
272
+
273
+ We hope this letter finds you well. This letter serves as an official warning in regards to an incident that occurred on October 25, 2023, in which signs of drug misuse were reported. This alleged conduct is highly alarming and contrary to our company’s policies and guidelines.
274
+
275
+ The incident detailed allegations of unusual behavior indicative of possible substance abuse. Colleagues reported that you appeared disoriented and were found in the restroom for an extended period. Additionally, unidentified pills were discovered near your workspace.
276
+
277
+ Our company is committed to providing a safe and non-detrimental work environment for all its workforce. This commitment is compromised when any type of drug misuse occurs. We draw your attention to our Employee Handbook, specifically 'Section 5: Substance Misuse', that states any illegal drug use, substance misuse or distribution thereof is strictly prohibited and could be subject to severe disciplinary action, including termination.
278
+
279
+ This is an official warning that such behavior misaligns with our workplace norms and cannot be tolerated. Another incident like this or similar breach of company guidelines may lead to further disciplinary action, up to and including termination of employment.
280
+
281
+ Please note that this is not an assumption of your guilt but an assertion to remain vigilant against any practices that could harm you or the workplace environment. We encourage you to utilize our Personal Assistance Program (PAP) as a tool for counseling and support, if needed.
282
+
283
+ We believe in your potential to rectify this situation and to maintain the high standards we are all accustomed to in our organization.
284
+
285
+ Should you need assistance or if you wish to discuss this matter further, please feel free to reach out to me. We appreciate your immediate attention to this important issue.
286
+
287
+ Yours sincerely,
288
+
289
+ Mrs. Charlotte Johnson
290
+ Human Resources Manager
291
+ '''
292
+
293
+
294
+ # Step 2: Replace the entities back in processed by GPT-4 text.
295
 
296
  import ast
297
 
 
307
 
308
  return swapped_str
309
 
 
310
 
311
+ gpt_response = completion.choices[0].message.content
312
+ entities_map = swap_keys_and_values_in_string(entities_map)
313
+ prompt = f"USER: Rephrase with {entities_map}: {gpt_response}\n\nASSISTANT:"
314
+ inputs = tokenizer(prompt, return_tensors='pt').to('cuda:0')
315
+ outputs = model.generate(inputs.input_ids, max_new_tokens=2048)
316
+ output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
317
 
318
+ #output:
319
+ '''
320
+ Subject: Official Warning – Substance Misuse Policy Violation
321
 
322
+ Dear Mr. Benjamin Mitchell,
323
 
324
+ We hope this letter finds you well. This letter serves as an official warning in regards to an incident that occurred on January 15,
 
325
 
326
+ 2020, in which signs of drug misuse were reported. This alleged conduct is highly alarming and contrary to our company’s policies and guidelines.
327
 
328
+ The incident detailed allegations of unusual behavior indicative of possible substance abuse. Colleagues reported that you appeared disoriented and
329
 
330
+ were found in the restroom for an extended period. Additionally, unidentified pills were discovered near your workspace.
331
 
332
+ Our company is committed to providing a safe and non-detrimental work environment for all its workforce. This commitment is compromised when any
333
 
334
+ type of drug misuse occurs. We draw your attention to our Employee Handbook, specifically 'Section 5: Substance Misuse', that states any illegal
335
 
336
+ drug use, substance misuse or distribution thereof is strictly prohibited and could be subject to severe disciplinary action, including termination.
337
 
338
+ This is an official warning that such behavior misaligns with our workplace norms and cannot be tolerated. Another incident like this or similar breach
339
 
340
+ of company guidelines may lead to further disciplinary action, up to and including termination of employment.
 
 
 
341
 
342
+ Please note that this is not an assumption of your guilt but an assertion to remain vigilant against any practices that could harm you or the workplace
343
 
344
+ environment. We encourage you to utilize our Employee Assistance Program (EAP) as a tool for counseling and support, if needed.
345
 
346
+ We believe in your potential to rectify this situation and to maintain the high standards we are all accustomed to in our organization.
 
 
 
347
 
348
+ Should you need assistance or if you wish to discuss this matter further, please feel free to reach out to me. We appreciate your immediate attention
349
 
350
+ to this important issue.
351
 
352
+ Yours sincerely,
 
 
 
353
 
354
+ Mrs. Alice Williams,
355
 
356
+ Human Resources Manager.
357
  '''
358
+
 
 
 
 
359
  ```
360
+
361
 
362
  # Dataset and Training Documentation for Audit
363
  If you require the original dataset used for training this model, or further documentation related to its training and architecture for audit purposes, you can request this information by contacting us.