commit
Browse files
README.md
CHANGED
@@ -1,3 +1,68 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- metricspace/AnonymeData
|
5 |
+
pipeline_tag: text2text-generation
|
6 |
---
|
7 |
+
|
8 |
+
# EntityAnonymization-3B-V0.9
|
9 |
+
|
10 |
+
# License
|
11 |
+
This Natural Language Processing (NLP) model is made available under the Apache License, Version 2.0. You are free to use, modify, and distribute this software according to the terms and conditions of the Apache 2.0 License. For the full license text, please refer to the Apache 2.0 License.
|
12 |
+
# Usage and Specific Capabilities
|
13 |
+
## Text Length Limitation
|
14 |
+
The model is optimized to analyze texts containing up to 2048 tokens. If your text exceeds this limit, we recommend splitting it into smaller chunks, each containing no more than 512 tokens. Each chunk can then be processed separately.
|
15 |
+
## Supported Languages
|
16 |
+
Bulgarian, Chinese, Czech, Dutch, English, Estonian, Finnish, French, German, Greek, Indonesian, Italian, Japanese, Korean, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Turkish
|
17 |
+
|
18 |
+
|
19 |
+
# Use Cases
|
20 |
+
## Entity Resampling and Anonymization
|
21 |
+
Introducing a cutting-edge model tailored to the task of extracting entities from sensitive text and anonymizing it. This model specializes in identifying and safeguarding confidential information, ensuring organizations' compliance with stringent data privacy regulations and minimizing the potential for inadvertent disclosure of classified data and trade secrets.
|
22 |
+
# Example Usage
|
23 |
+
```python
|
24 |
+
import torch
|
25 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
26 |
+
tokenizer = AutoTokenizer.from_pretrained("metricspace/EntityAnonymization-3B-V0.9")
|
27 |
+
model = AutoModelForCausalLM.from_pretrained("metricspace/EntityAnonymization-3B-V0.9", torch_dtype=torch.bfloat16)
|
28 |
+
|
29 |
+
def extract_assistant_response(input_text):
|
30 |
+
# Find all occurrences of "ASSISTANT:" in the input text
|
31 |
+
matches = re.finditer(r'ASSISTANT:', input_text)
|
32 |
+
|
33 |
+
# Extract text after each occurrence of "ASSISTANT:"
|
34 |
+
assistant_responses = []
|
35 |
+
for match in matches:
|
36 |
+
start_index = match.end() # Get the index where "ASSISTANT:" ends
|
37 |
+
response = input_text[start_index:].strip()
|
38 |
+
assistant_responses.append(response)
|
39 |
+
|
40 |
+
return assistant_responses
|
41 |
+
|
42 |
+
|
43 |
+
text_to_anonymize = "Sophia had always been enchanted by Venice, a historic city nestled in the heart of the Venetian lagoon. She had explored Venice on numerous occasions, each visit revealing hidden treasures in the enchanting city. On her latest trip, Sophia met Marco, a local historian, who shared captivating stories about the history of Venice.""
|
44 |
+
|
45 |
+
|
46 |
+
prompt = f'USER: Resample the entities: {text_to_anonymize}\n\nASSISTANT:'
|
47 |
+
inputs = tokenizer(prompt, return_tensors='pt').to('cuda')
|
48 |
+
output_entities = model.generate(inputs.input_ids, max_new_tokens=250, do_sample=False, top_k=50, top_p=0.98, num_beams=1)
|
49 |
+
output_entities_text = tokenizer.decode(output_entities[0], skip_special_tokens=True)
|
50 |
+
|
51 |
+
# extracting entities text from assistant response
|
52 |
+
generated_part = extract_assistant_response(output_text_1)[0]
|
53 |
+
|
54 |
+
prompt_2 = f"USER: Rephrase with {generated_part}: {text_to_anonymize}\n\nASSISTANT:"
|
55 |
+
inputs = tokenizer(prompt_2, return_tensors='pt').to('cuda')
|
56 |
+
output_resampled = model.generate(inputs.input_ids, max_new_tokens=500, do_sample=False, top_k=50, top_p=0.98)
|
57 |
+
output_resampled_text = tokenizer.decode(output_resampled[0], skip_special_tokens=True)
|
58 |
+
|
59 |
+
|
60 |
+
print(output_resampled_text)
|
61 |
+
```
|
62 |
+
…
|
63 |
+
# Dataset and Training Documentation for Audit
|
64 |
+
If you require the original dataset used for training this model, or further documentation related to its training and architecture for audit purposes, you can request this information by contacting us.
|
65 |
+
Further Tuning Services for Custom Use Cases
|
66 |
+
For specialized needs or custom use cases, we offer further tuning services to adapt the model to your specific requirements. To inquire about these services, please reach out to us at:
|
67 |
+
📧 Email: info@metric-space.ai
|
68 |
+
Please note that the availability of the dataset, additional documentation, and tuning services may be subject to certain conditions and limitations.
|