--- language: pt license: apache-2.0 widget: - text: "123.456.789-0" example_title: "CPF" - text: "75528899000119" example_title: "CNPJ (sem pontuação)" - text: "Nome Completo" example_title: "Felipe Casali Silva" - text: "Dados diversos" example_title: "Felipe Casali Silva, Teste, Rio de Janeiro, RJ" --- # lgpd_pii_identifier : LGPD PII Identifier lgpd_pii_identifier is a pre-trained NLP model to identify sensitive data in the scope of LGPD (Lei Geral de Proteção de Dados) The goal is to have a tool to identify document numbers like CNPJ, CPF, people's names and other kind of sensitive data, allowing companies to find and anonymize data according to their businness needs, and governance rules. ## Applications ### Identify PII (Personal Identifiable Information) in the scope of LGPD # WIP (Add image here) ## Usage In order to use the model, you need to get the HuggingFace auth token. You can get it [here](https://huggingface.co/settings/token). ```python from transformers import DistilBertModel, DistilBertTokenizer import numpy as np pred_mapper = { 0: "cnpj", 1: "cpf", 2: "nome", 3: "estado" } tokenizer = DistilBertTokenizer.from_pretrained("FelipeCasali-USP/lgpd_pii_identifier") lgpd_pii_identifier = DistilBertModel.from_pretrained("FelipeCasali-USP/lgpd_pii_identifier") tokens = tokenizer(["String to be analized"], return_tensors="pt", padding=True, truncation=True, max_length=512) lgpd_pii_identifier_outputs = lgpd_pii_identifier(**tokens) preds = [pred_mapper[np.argmax(pred)] for pred in lgpd_pii_identifier_outputs.logits.cpu().detach().numpy()] ``` ## Author - [Felipe Casali](https://www.linkedin.com/in/felipecasali/) ## Paper - Paper: WIP - MBA thesis: [lgpd_pii_identifier: Proteção de Dados Sensíveis na Era da Inteligência Artificial](WIP)