metadata

language: pt
license: apache-2.0
widget:
  - text: 123.456.789-0
    example_title: CPF
  - text: '75528899000119'
    example_title: CNPJ (sem pontuação)
  - text: Nome Completo
    example_title: Felipe Casali Silva
  - text: Dados diversos
    example_title: Felipe Casali Silva, Teste, Rio de Janeiro, RJ

lgpd_pii_identifier : Financial BERT PT BR

lgpd_pii_identifier is a pre-trained NLP model to identify sensitive data in the scope of LGPD (Lei Geral de Proteção de Dados)

The goal is to have a tool to identify document numbers like CNPJ, CPF, people's names and other kind of sensitive data, allowing companies to find and anonymize data according to their businness needs, and governance rules.

Applications

Identify PII (Personal Identifiable Information) in the scope of LGPD

WIP (Add image here)

Usage

In order to use the model, you need to get the HuggingFace auth token. You can get it here.

from transformers import AutoTokenizer, BertForSequenceClassification
import numpy as np
  
pred_mapper = {
    0: "cnpj",
    1: "cpf",
    2: "nome",
    3: "estado"
  }

tokenizer = AutoTokenizer.from_pretrained("FelipeCasali-USP/lgpd_pii_identifier")
lgpd_pii_identifier = BertForSequenceClassification.from_pretrained("FelipeCasali-USP/lgpd_pii_identifier")

tokens = tokenizer(["String to be analized"], return_tensors="pt",
                    padding=True, truncation=True, max_length=512)
lgpd_pii_identifier_outputs = lgpd_pii_identifier(**tokens)

preds = [pred_mapper[np.argmax(pred)] for pred in lgpd_pii_identifier_outputs.logits.cpu().detach().numpy()]

Author

Felipe Casali

Paper

Paper: WIP
MBA thesis: lgpd_pii_identifier: Proteção de Dados Sensíveis na Era da Inteligência Artificial