--- license: apache-2.0 datasets: - metricspace/AnonymeData pipeline_tag: text2text-generation --- # EntityAnonymization-3B-V0.9 # License This Natural Language Processing (NLP) model is made available under the Apache License, Version 2.0. You are free to use, modify, and distribute this software according to the terms and conditions of the Apache 2.0 License. For the full license text, please refer to the Apache 2.0 License. # Usage and Specific Capabilities ## Text Length Limitation The model is optimized to analyze texts containing up to 2048 tokens. If your text exceeds this limit, we recommend splitting it into smaller chunks, each containing no more than 2048 tokens. Each chunk can then be processed separately. ## Supported Languages Bulgarian, Chinese, Czech, Dutch, English, Estonian, Finnish, French, German, Greek, Indonesian, Italian, Japanese, Korean, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Turkish # Use Cases ## Entity Resampling and Anonymization Introducing a cutting-edge model tailored to the task of extracting entities from sensitive text and anonymizing it. This model specializes in identifying and safeguarding confidential information, ensuring organizations' compliance with stringent data privacy regulations and minimizing the potential for inadvertent disclosure of classified data and trade secrets. # Example Usage ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("metricspace/EntityAnonymization-3B-V0.9") model = AutoModelForCausalLM.from_pretrained("metricspace/EntityAnonymization-3B-V0.9", torch_dtype=torch.bfloat16) import re def extract_last_assistant_response(input_text): # Find the occurrence of "ASSISTANT:" in the input text match = re.search(r'ASSISTANT:', input_text) # Get the index where the last "ASSISTANT:" ends start_index = match.end() response = input_text[start_index:].strip() return response text_to_anonymize = '''Our organization manages a sophisticated data analytics platform ([login to view URL]) that highlights our cutting-edge data visualization techniques. In response to evolving business needs, we've recognized the imperative to optimize our data handling processes. As part of this initiative, we're seizing the opportunity to standardize the codebase for our web and mobile applications using a unified approach with Vue.js. We're currently seeking a talented developer to spearhead this transformation, ensuring a seamless separation between backend data processing and frontend presentation layers. The revised architecture will incorporate three critical APIs (Google Maps for location services, AccuWeather for weather data, and our in-house Analytica API for advanced analytics). The backend restructuring is a critical component, designed to serve as a showcase for the capabilities of our Analytica API. The frontend, both for the web and mobile interfaces, will maintain the current user experience using the existing design assets. We are actively searching for a Vue.js developer who can efficiently interpret our project vision and deliver an elegant, sustainable solution.''' prompt = f'USER: Resample the entities: {text_to_anonymize}\n\nASSISTANT:' inputs = tokenizer(prompt, return_tensors='pt').to('cuda') output_entities = model.generate(inputs.input_ids, max_new_tokens=250, do_sample=False, top_k=50, top_p=0.98, num_beams=1) output_entities_text = tokenizer.decode(output_entities[0], skip_special_tokens=True) # extracting entities text from assistant response generated_part = extract_assistant_response(output_text_1) prompt_2 = f"USER: Rephrase with {generated_part}: {text_to_anonymize}\n\nASSISTANT:" inputs = tokenizer(prompt_2, return_tensors='pt').to('cuda') output_resampled = model.generate(inputs.input_ids, max_new_tokens=500, do_sample=False, top_k=50, top_p=0.98) output_resampled_text = tokenizer.decode(output_resampled[0], skip_special_tokens=True) print(output_resampled_text) #output ''' Our enterprise manages an advanced data analysis platform ([login to view URL]) that highlights our innovative data visualization methods. In response to evolving business needs, we've recognized the imperative to optimize our data handling processes. As part of this initiative, we're seizing the opportunity to standardize the codebase for our online and mobile applications using a unified approach with Vega.js. We're currently seeking a talented developer to spearhead this transformation, ensuring a seamless separation between backend data processing and frontend presentation layers. The revised architecture will incorporate three critical APIs (Maple Maps for location services, MeteorWeather for weather data, and our in-house Analytica API for advanced analytics). The backend restructuring is a critical component, designed to serve as a showcase for the capabilities of our Analytica API. The frontend, both for the web and mobile interfaces, will maintain the current user experience using the existing design assets. We are actively searching for a Vega.js developer who can efficiently interpret our project vision and deliver an elegant, sustainable solution ''' ``` … # Dataset and Training Documentation for Audit If you require the original dataset used for training this model, or further documentation related to its training and architecture for audit purposes, you can request this information by contacting us. Further Tuning Services for Custom Use Cases For specialized needs or custom use cases, we offer further tuning services to adapt the model to your specific requirements. To inquire about these services, please reach out to us at: 📧 Email: info@metric-space.ai Please note that the availability of the dataset, additional documentation, and tuning services may be subject to certain conditions and limitations.