MichaelWelsch
commited on
Commit
•
d294f61
1
Parent(s):
1118108
Update README.md
Browse files
README.md
CHANGED
@@ -21,12 +21,18 @@ Bulgarian, Chinese, Czech, Dutch, English, Estonian, Finnish, French, German, Gr
|
|
21 |
Introducing a cutting-edge model tailored to the task of extracting entities from sensitive text and anonymizing it. This model specializes in identifying and safeguarding confidential information, ensuring organizations' compliance with stringent data privacy regulations and minimizing the potential for inadvertent disclosure of classified data and trade secrets.
|
22 |
# Example Usage
|
23 |
```python
|
|
|
|
|
|
|
|
|
24 |
import torch
|
|
|
|
|
25 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|
|
26 |
tokenizer = AutoTokenizer.from_pretrained("metricspace/EntityAnonymization-3B-V0.9")
|
27 |
model = AutoModelForCausalLM.from_pretrained("metricspace/EntityAnonymization-3B-V0.9", torch_dtype=torch.bfloat16)
|
28 |
-
|
29 |
-
import re
|
30 |
|
31 |
def extract_last_assistant_response(input_text):
|
32 |
# Find the occurrence of "ASSISTANT:" in the input text
|
@@ -37,39 +43,132 @@ def extract_last_assistant_response(input_text):
|
|
37 |
response = input_text[start_index:].strip()
|
38 |
return response
|
39 |
|
|
|
|
|
40 |
|
41 |
-
|
42 |
-
We're currently seeking a talented developer to spearhead this transformation, ensuring a seamless separation between backend data processing and frontend presentation layers. The revised architecture will incorporate three critical APIs (Google Maps for location services, AccuWeather for weather data, and our in-house Analytica API for advanced analytics).
|
43 |
-
The backend restructuring is a critical component, designed to serve as a showcase for the capabilities of our Analytica API. The frontend, both for the web and mobile interfaces, will maintain the current user experience using the existing design assets.
|
44 |
-
We are actively searching for a Vue.js developer who can efficiently interpret our project vision and deliver an elegant, sustainable solution.'''
|
45 |
|
|
|
46 |
|
47 |
-
|
48 |
-
inputs = tokenizer(prompt, return_tensors='pt').to('cuda')
|
49 |
-
output_entities = model.generate(inputs.input_ids, max_new_tokens=250, do_sample=False, top_k=50, top_p=0.98, num_beams=1)
|
50 |
-
output_entities_text = tokenizer.decode(output_entities[0], skip_special_tokens=True)
|
51 |
|
52 |
-
|
53 |
-
|
|
|
|
|
|
|
54 |
|
55 |
-
|
56 |
-
|
57 |
-
|
58 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
59 |
|
|
|
|
|
|
|
60 |
|
61 |
-
|
|
|
|
|
|
|
|
|
62 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
63 |
#output
|
64 |
'''
|
65 |
-
|
66 |
-
|
67 |
-
|
68 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
69 |
'''
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
70 |
```
|
71 |
|
72 |
|
|
|
73 |
# Example inverted usage
|
74 |
```python
|
75 |
import torch
|
|
|
21 |
Introducing a cutting-edge model tailored to the task of extracting entities from sensitive text and anonymizing it. This model specializes in identifying and safeguarding confidential information, ensuring organizations' compliance with stringent data privacy regulations and minimizing the potential for inadvertent disclosure of classified data and trade secrets.
|
22 |
# Example Usage
|
23 |
```python
|
24 |
+
!pip install sentencepiece
|
25 |
+
!pip install transformers
|
26 |
+
```
|
27 |
+
```python
|
28 |
import torch
|
29 |
+
import json
|
30 |
+
import re
|
31 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
32 |
+
|
33 |
tokenizer = AutoTokenizer.from_pretrained("metricspace/EntityAnonymization-3B-V0.9")
|
34 |
model = AutoModelForCausalLM.from_pretrained("metricspace/EntityAnonymization-3B-V0.9", torch_dtype=torch.bfloat16)
|
35 |
+
model.to("cuda:0")
|
|
|
36 |
|
37 |
def extract_last_assistant_response(input_text):
|
38 |
# Find the occurrence of "ASSISTANT:" in the input text
|
|
|
43 |
response = input_text[start_index:].strip()
|
44 |
return response
|
45 |
|
46 |
+
# Input example
|
47 |
+
text_to_anonymize = '''Subject: HR Incident Report: Speculation of Drug Misuse by Mr. Benjamin Mitchell
|
48 |
|
49 |
+
Dear Mrs. Alice Williams,
|
|
|
|
|
|
|
50 |
|
51 |
+
I trust you're well. I wish to bring to your attention a concerning matter involving one of our esteemed employees, Mr. Benjamin Mitchell.
|
52 |
|
53 |
+
Employee Details:
|
|
|
|
|
|
|
54 |
|
55 |
+
Name: Benjamin Mitchell
|
56 |
+
Position: Senior Marketing Creative
|
57 |
+
Department: Marketing
|
58 |
+
Date of Joining: January 15, 2020
|
59 |
+
Reporting Manager: Mrs. Jane Fitzgerald
|
60 |
|
61 |
+
Incident Details:
|
62 |
+
Date: October 25, 2023
|
63 |
+
Location: Restroom, 4th Floor
|
64 |
+
Time: 11:45 AM
|
65 |
+
|
66 |
+
Description of Incident:
|
67 |
+
On the date specified, a few colleagues reported unusual behavior exhibited by Mr. Mitchell, which raised concerns about potential drug misuse. Witnesses mentioned that Benjamin appeared disoriented and was found in the restroom for an extended period. Some employees also discovered unidentified pills in close proximity to his chair.
|
68 |
+
|
69 |
+
Witness Accounts:
|
70 |
+
Ms. Emily Clark: "Benjamin seemed distracted and not his usual self today. He's been taking frequent breaks and appears a bit disoriented."
|
71 |
+
Mr. Robert Taylor: "I found some pills near his chair on the floor. It's concerning, and I felt it necessary to report."
|
72 |
|
73 |
+
Immediate Actions Taken:
|
74 |
+
Mr. Benjamin Mitchell was approached by HR for a preliminary conversation to understand the situation.
|
75 |
+
Mrs. Jane Fitzgerald, his reporting manager, was made aware of the concerns.
|
76 |
|
77 |
+
Recommendations:
|
78 |
+
It's crucial to have a private and supportive conversation with Mr. Mitchell to understand if there's an underlying issue.
|
79 |
+
Consider referring Benjamin to our Employee Assistance Program (EAP) for counseling or support.
|
80 |
+
It may be beneficial to organize a session on drug awareness and workplace safety for all employees.
|
81 |
+
It's of utmost importance to handle this situation with sensitivity and discretion, ensuring the wellbeing of Mr. Mitchell and maintaining the integrity of our workplace environment. This email serves as a formal documentation of the incident. We'll determine the subsequent course of action based on your guidance and the recommendations provided.
|
82 |
|
83 |
+
Looking forward to your direction on this matter.
|
84 |
+
'''
|
85 |
+
print(text_to_anonymize)
|
86 |
+
|
87 |
+
# Step 1: Extracting entities from text
|
88 |
+
prompt = f'USER: Resample the entities: {text_to_anonymize}\n\nASSISTANT:'
|
89 |
+
inputs = tokenizer(prompt, return_tensors='pt').to('cuda:0')
|
90 |
+
output_entities = model.generate(inputs.input_ids, max_new_tokens=250, do_sample=True, temperature=0.85)
|
91 |
+
raw_output_entities_text = tokenizer.decode(output_entities[0])
|
92 |
+
entities = extract_last_assistant_response(raw_output_entities_text)
|
93 |
+
|
94 |
+
print('-----------Entities----------------')
|
95 |
+
try:
|
96 |
+
entities = re.search(r"\{.*?\}", entities, re.DOTALL).group(0)
|
97 |
+
data_dict = eval(entities)
|
98 |
+
formatted_json = json.dumps(data_dict, indent=4)
|
99 |
+
print(formatted_json)
|
100 |
+
except:
|
101 |
+
#bad formated json
|
102 |
+
print(entities)
|
103 |
#output
|
104 |
'''
|
105 |
+
{
|
106 |
+
"Mr. Benjamin Mitchell": "Mr. Edward Martin",
|
107 |
+
"Mrs. Alice Williams": "Mrs. Charlotte Johnson",
|
108 |
+
"January 15, 2020": "January 15, 2020",
|
109 |
+
"Mrs. Jane Fitzgerald": "Mrs. Jane Anderson",
|
110 |
+
"October 25, 2023": "October 25, 2023",
|
111 |
+
"4th Floor": "topmost floor",
|
112 |
+
"11:45 AM": "midday",
|
113 |
+
"Emily Clark": "Marie Foster",
|
114 |
+
"Employee Assistance Program (EAP)": "Personal Assistance Program (PAP)",
|
115 |
+
"Robert Taylor": "Benjamin Adams",
|
116 |
+
}
|
117 |
'''
|
118 |
+
|
119 |
+
# Step 2: Use entities to resample the original text
|
120 |
+
prompt_2 = f"USER: Rephrase with {entities}: {text_to_anonymize}\n\nASSISTANT:"
|
121 |
+
inputs = tokenizer(prompt_2, return_tensors='pt').to('cuda:0')
|
122 |
+
output_resampled = model.generate(inputs.input_ids, max_length=2048)
|
123 |
+
raw_output_resampled_text = tokenizer.decode(output_resampled[0])
|
124 |
+
resampled_text = extract_last_assistant_response(raw_output_resampled_text)
|
125 |
+
print('---------Anonymized Version--------')
|
126 |
+
print(resampled_text)
|
127 |
+
#output:
|
128 |
+
'''
|
129 |
+
Subject: HR Incident Report: Speculation of Drug Misuse by Mr. Edward Martin
|
130 |
+
|
131 |
+
Dear Mrs. Charlotte Johnson,
|
132 |
+
|
133 |
+
I trust you're well. I wish to bring to your attention a concerning matter involving one of our esteemed employees, Mr. Edward Martin.
|
134 |
+
|
135 |
+
Employee Details:
|
136 |
+
|
137 |
+
Name: Edward Martin
|
138 |
+
Position: Senior Marketing Creative
|
139 |
+
Department: Marketing
|
140 |
+
Date of Joining: January 15, 2020
|
141 |
+
Reporting Manager: Mrs. Jane Anderson
|
142 |
+
|
143 |
+
Incident Details:
|
144 |
+
Date: October 25, 2023
|
145 |
+
Location: Restroom, topmost floor
|
146 |
+
Time: midday
|
147 |
+
|
148 |
+
Description of Incident:
|
149 |
+
On the date specified, a few colleagues reported unusual behavior exhibited by Mr. Martin, which raised concerns about potential drug misuse. Witnesses mentioned that Edward appeared disoriented and was found in the restroom for an extended period. Some employees also discovered unidentified pills in close proximity to his chair.
|
150 |
+
|
151 |
+
Witness Accounts:
|
152 |
+
Ms. Marie Foster: "Edward seemed distracted and not his usual self today. He's been taking frequent breaks and appears a bit disoriented."
|
153 |
+
Mr. Benjamin Adams: "I found some pills near his chair on the floor. It's concerning, and I felt it necessary to report."
|
154 |
+
|
155 |
+
Immediate Actions Taken:
|
156 |
+
Mr. Edward Martin was approached by People Management for a preliminary conversation to understand the situation.
|
157 |
+
Mrs. Jane Anderson, his reporting manager, was made aware of the concerns.
|
158 |
+
|
159 |
+
Recommendations:
|
160 |
+
It's crucial to have a private and supportive conversation with Mr. Martin to understand if there's an underlying issue.
|
161 |
+
Consider referring Edward to our Personal Assistance Program (PAP) for counseling or support.
|
162 |
+
It may be beneficial to organize a session on drug awareness and workplace safety for all employees.
|
163 |
+
It's of utmost importance to handle this situation with sensitivity and discretion, ensuring the wellbeing of Mr. Martin and maintaining the integrity of our workplace environment. This email serves as a formal documentation of the incident. We'll determine the subsequent course of action based on your guidance and the recommendations provided.
|
164 |
+
|
165 |
+
Looking forward to your direction on this matter.
|
166 |
+
'''
|
167 |
+
|
168 |
```
|
169 |
|
170 |
|
171 |
+
|
172 |
# Example inverted usage
|
173 |
```python
|
174 |
import torch
|