Wachu2005's picture
Update README.md
dd6f5ec
|
raw
history blame
No virus
13.7 kB
metadata
license: apache-2.0
datasets:
  - metricspace/AnonymeData
pipeline_tag: text2text-generation

EntityAnonymization-3B-V0.9

License

This Natural Language Processing (NLP) model is made available under the Apache License, Version 2.0. You are free to use, modify, and distribute this software according to the terms and conditions of the Apache 2.0 License. For the full license text, please refer to the Apache 2.0 License.

Usage and Specific Capabilities

Text Length Limitation

The model is optimized to analyze texts containing up to 2048 tokens. If your text exceeds this limit, we recommend splitting it into smaller chunks, each containing no more than 2048 tokens. Each chunk can then be processed separately.

Supported Languages

Bulgarian, Chinese, Czech, Dutch, English, Estonian, Finnish, French, German, Greek, Indonesian, Italian, Japanese, Korean, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Turkish

Use Cases

Entity Resampling and Anonymization

Introducing a cutting-edge model tailored to the task of extracting entities from sensitive text and anonymizing it. This model specializes in identifying and safeguarding confidential information, ensuring organizations' compliance with stringent data privacy regulations and minimizing the potential for inadvertent disclosure of classified data and trade secrets.

Example Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("metricspace/EntityAnonymization-3B-V0.9")
model = AutoModelForCausalLM.from_pretrained("metricspace/EntityAnonymization-3B-V0.9", torch_dtype=torch.bfloat16)

import re

def extract_last_assistant_response(input_text):
    # Find the occurrence of "ASSISTANT:" in the input text
    match = re.search(r'ASSISTANT:', input_text)

    # Get the index where the last "ASSISTANT:" ends
    start_index = match.end()
    response = input_text[start_index:].strip()
    return response


text_to_anonymize = '''Our organization manages a sophisticated data analytics platform ([login to view URL]) that highlights our cutting-edge data visualization techniques. In response to evolving business needs, we've recognized the imperative to optimize our data handling processes. As part of this initiative, we're seizing the opportunity to standardize the codebase for our web and mobile applications using a unified approach with Vue.js.
We're currently seeking a talented developer to spearhead this transformation, ensuring a seamless separation between backend data processing and frontend presentation layers. The revised architecture will incorporate three critical APIs (Google Maps for location services, AccuWeather for weather data, and our in-house Analytica API for advanced analytics).
The backend restructuring is a critical component, designed to serve as a showcase for the capabilities of our Analytica API. The frontend, both for the web and mobile interfaces, will maintain the current user experience using the existing design assets.
We are actively searching for a Vue.js developer who can efficiently interpret our project vision and deliver an elegant, sustainable solution.'''


prompt = f'USER: Resample the entities: {text_to_anonymize}\n\nASSISTANT:'
inputs = tokenizer(prompt, return_tensors='pt').to('cuda')
output_entities = model.generate(inputs.input_ids, max_new_tokens=250, do_sample=False, top_k=50, top_p=0.98, num_beams=1)
output_entities_text = tokenizer.decode(output_entities[0], skip_special_tokens=True)

# extracting entities text from assistant response
generated_part = extract_assistant_response(output_text_1)

prompt_2 = f"USER: Rephrase with {generated_part}: {text_to_anonymize}\n\nASSISTANT:"
inputs = tokenizer(prompt_2, return_tensors='pt').to('cuda')
output_resampled = model.generate(inputs.input_ids, max_new_tokens=500, do_sample=False, top_k=50, top_p=0.98)
output_resampled_text = tokenizer.decode(output_resampled[0], skip_special_tokens=True)


print(output_resampled_text)

#output
'''
Our enterprise manages an advanced data analysis platform ([login to view URL]) that highlights our innovative data visualization methods. In response to evolving business needs, we've recognized the imperative to optimize our data handling processes. As part of this initiative, we're seizing the opportunity to standardize the codebase for our online and mobile applications using a unified approach with Vega.js.
We're currently seeking a talented developer to spearhead this transformation, ensuring a seamless separation between backend data processing and frontend presentation layers. The revised architecture will incorporate three critical APIs (Maple Maps for location services, MeteorWeather for weather data, and our in-house Analytica API for advanced analytics).
The backend restructuring is a critical component, designed to serve as a showcase for the capabilities of our Analytica API. The frontend, both for the web and mobile interfaces, will maintain the current user experience using the existing design assets.
We are actively searching for a Vega.js developer who can efficiently interpret our project vision and deliver an elegant, sustainable solution
'''

Example inverted usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("metricspace/EntityAnonymization-3B-V0.9")
model = AutoModelForCausalLM.from_pretrained("metricspace/EntityAnonymization-3B-V0.9", torch_dtype=torch.bfloat16)

import re

def extract_last_assistant_response(input_text):
    # Find the occurrence of "ASSISTANT:" in the input text
    match = re.search(r'ASSISTANT:', input_text)

    # Get the index where the last "ASSISTANT:" ends
    start_index = match.end()
    response = input_text[start_index:].strip()
    return response

import ast

def swap_keys_and_values_in_string(input_str):
    # Convert the input string to a dictionary
    input_dict = ast.literal_eval(input_str)

    # Swap the keys and values
    swapped_dict = {v: k for k, v in input_dict.items()}

    # Convert the swapped dictionary back to a string
    swapped_str = str(swapped_dict)

    return swapped_str

# sample text for entitity extraction and resampling

original_text = '''Our organization, XYZ Biotech, operates at the forefront of groundbreaking pharmaceutical research, renowned for our pioneering drug development and breakthrough treatments. In light of the ever-evolving regulatory landscape and the need to safeguard our research endeavors, we've recognized the critical importance of enhancing our compliance and data security protocols. To this end, we are on the lookout for a top-notch regulatory affairs specialist to spearhead this transformation, ensuring the rigorous adherence to industry standards and the protection of our confidential research data.

This comprehensive initiative encompasses not only ensuring regulatory compliance but also the implementation of three vital security measures. We will be utilizing CipherGuard's state-of-the-art encryption technology to secure our research data, deploying BioShield's advanced security protocols for laboratory access, and integrating SecureLabs' real-time data monitoring and threat detection systems.

The enhancement of our regulatory affairs and data security measures is a critical component in safeguarding our proprietary research, reinforcing our commitment to drug development excellence. While we prioritize compliance and data protection, the user experience for our research teams and partners will remain user-friendly and efficient, whether they are using our proprietary research software, "BioDiscover," or our mobile applications.

We are actively in search of a regulatory affairs specialist who can comprehend the importance of maintaining compliance and data security in our industry and who can deliver a comprehensive, airtight solution that not only ensures our adherence to regulations but also safeguards the confidential nature of our research at XYZ Biotech.''',
'''


# another different anonymized text with replaced entitie

anonymized_text = '''ABC Pharmaceuticals, a renowned player in the pharmaceutical industry, is dedicated to pioneering drug development and breakthrough treatments. In response to the ever-evolving regulatory landscape and the need to protect our research initiatives, we have identified the paramount importance of enhancing our compliance and data security protocols. As a part of this strategic shift, we are actively searching for a top-tier regulatory affairs specialist to lead this transformation, ensuring unwavering adherence to industry standards and the safeguarding of our confidential research data.

This comprehensive initiative goes beyond regulatory compliance and entails the implementation of three crucial security measures. We will be leveraging the cutting-edge encryption technology provided by CodeGuard to secure our research data, implementing BioProtect's advanced security protocols for laboratory access, and integrating the real-time data monitoring and threat detection systems offered by SecureTech.

The enhancement of our regulatory affairs and data security measures is a pivotal component in safeguarding our proprietary research, reinforcing our commitment to excellence in drug development. While we prioritize compliance and data protection, the user experience for our research teams and partners will remain user-friendly and efficient, whether they are using our proprietary research software, "BioDiscover," or our mobile applications.

We are actively seeking a regulatory affairs specialist who comprehends the critical importance of upholding compliance and data security in our industry and possesses the expertise to deliver a comprehensive and impervious solution that ensures not only our adherence to regulations but also preserves the confidentiality of our research data at ABC Pharmaceuticals.'''


prompt = f'USER: Resample the entities: {original_text}\n\nASSISTANT:'
inputs = tokenizer(prompt, return_tensors='pt').to('cuda')
outputs = model.generate(inputs.input_ids, max_new_tokens=250, do_sample=False, top_k=50, top_p=0.98, num_beams=1)
output_text_1 = tokenizer.decode(outputs[0], skip_special_tokens=True)


generated_part = extract_assistant_response(output_text_1)

# inverting the entity map
# {'XYZ Biotech': 'ABC Pharmaceuticals', 'CipherGuard': 'CodeGuard', 'BioShield': 'BioProtect', 'SecureLabs': 'SecureTech', 'BioDiscover': 'BioDiscover'}
# inverted to this:
# {'ABC Pharmaceuticals': 'XYZ Biotech', 'CodeGuard': 'CipherGuard', 'BioProtect': 'BioShield', 'SecureTech': 'SecureLabs', 'BioDiscover': 'BioDiscover'}

inverted_entities = swap_keys_and_values_in_string(generated_part)


prompt_2 = f"USER: Rephrase with {inverted_entities}: {anonymized_text}\n\nASSISTANT:"
inputs = tokenizer(prompt_2, return_tensors='pt').to('cuda')
outputs = model.generate(inputs.input_ids, max_new_tokens=500, do_sample=False, top_k=50, top_p=0.98)
output_text_2 = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(output_text_2)

'''
XYZ Biotech, a renowned player in the biotech industry, is dedicated to pioneering drug development and breakthrough treatments. In response to the ever-evolving regulatory landscape and the need to protect our research initiatives, we have identified the paramount importance of enhancing our compliance and data security protocols. As a part of this strategic shift, we are actively searching for a top-tier regulatory affairs specialist to lead this transformation, ensuring unwavering adherence to industry standards and the safeguarding of our confidential research data.
This comprehensive initiative goes beyond regulatory compliance and entails the implementation of three crucial security measures. We will be leveraging the cutting-edge encryption technology provided by CipherGuard to secure our research data, implementing BioShield's advanced security protocols for laboratory access, and integrating the real-time data monitoring and threat detection systems offered by SecureLabs.
The enhancement of our regulatory affairs and data security measures is a pivotal component in safeguarding our proprietary research, reinforcing our commitment to excellence in drug development. While we prioritize compliance and data protection, the user experience for our research teams and partners will remain user-friendly and efficient, whether they are using our proprietary research software, "BioDiscover," or our mobile applications.
We are actively seeking a regulatory affairs specialist who comprehends the critical importance of upholding compliance and data security in our industry and possesses the expertise to deliver a comprehensive and impervious solution that ensures not only our adherence to regulations but also preserves the confidentiality of our research data at XYZ Biotech.
'''

Dataset and Training Documentation for Audit

If you require the original dataset used for training this model, or further documentation related to its training and architecture for audit purposes, you can request this information by contacting us. Further Tuning Services for Custom Use Cases For specialized needs or custom use cases, we offer further tuning services to adapt the model to your specific requirements. To inquire about these services, please reach out to us at: 📧 Email: info@metric-space.ai Please note that the availability of the dataset, additional documentation, and tuning services may be subject to certain conditions and limitations.