metadata
language: pt
license: apache-2.0
widget:
- text: 123.456.789-0
example_title: CPF
- text: '75528899000119'
example_title: CNPJ (sem pontuação)
- text: Nome Completo
example_title: Felipe Casali Silva
- text: Dados diversos
example_title: Felipe Casali Silva, Teste, Rio de Janeiro, RJ
lgpd_pii_identifier : Financial BERT PT BR
lgpd_pii_identifier is a pre-trained NLP model to identify sensitive data in the scope of LGPD (Lei Geral de Proteção de Dados)
The goal is to have a tool to identify document numbers like CNPJ, CPF, people's names and other kind of sensitive data, allowing companies to find and anonymize data according to their businness needs, and governance rules.
Applications
Identify PII (Personal Identifiable Information) in the scope of LGPD
WIP (Add image here)
Usage
In order to use the model, you need to get the HuggingFace auth token. You can get it here.
from transformers import AutoTokenizer, BertForSequenceClassification
import numpy as np
pred_mapper = {
0: "cnpj",
1: "cpf",
2: "nome",
3: "estado"
}
tokenizer = AutoTokenizer.from_pretrained("FelipeCasali-USP/lgpd_pii_identifier")
lgpd_pii_identifier = BertForSequenceClassification.from_pretrained("FelipeCasali-USP/lgpd_pii_identifier")
tokens = tokenizer(["String to be analized"], return_tensors="pt",
padding=True, truncation=True, max_length=512)
lgpd_pii_identifier_outputs = lgpd_pii_identifier(**tokens)
preds = [pred_mapper[np.argmax(pred)] for pred in lgpd_pii_identifier_outputs.logits.cpu().detach().numpy()]
Author
Paper
- Paper: WIP
- MBA thesis: lgpd_pii_identifier: Proteção de Dados Sensíveis na Era da Inteligência Artificial