---
language: pt
license: apache-2.0

widget:
- text: "123.456.789-0"
  example_title: "CPF"
- text: "75528899000119"
  example_title: "CNPJ (sem pontuação)"
- text: "Nome Completo"
  example_title: "Felipe Casali Silva"
- text: "Dados diversos"
  example_title: "Felipe Casali Silva, Teste, Rio de Janeiro, RJ"
---

# lgpd_pii_identifier : LGPD PII Identifier

lgpd_pii_identifier is a pre-trained NLP model to identify sensitive data in the scope of LGPD (Lei Geral de Proteção de Dados)

The goal is to have a tool to identify document numbers like CNPJ, CPF, people's names and other kind of sensitive data, allowing companies to find and anonymize
data according to their businness needs, and governance rules.

## Applications

### Identify PII (Personal Identifiable Information) in the scope of LGPD

# WIP (Add image here)


## Usage

In order to use the model, you need to get the HuggingFace auth token. You can get it [here](https://huggingface.co/settings/token).

```python
from transformers import DistilBertModel, DistilBertTokenizer
import numpy as np
  
pred_mapper = {
    0: "cnpj",
    1: "cpf",
    2: "nome",
    3: "estado"
  }

tokenizer = DistilBertTokenizer.from_pretrained("FelipeCasali-USP/lgpd_pii_identifier")
lgpd_pii_identifier = DistilBertModel.from_pretrained("FelipeCasali-USP/lgpd_pii_identifier")

tokens = tokenizer(["String to be analized"], return_tensors="pt",
                    padding=True, truncation=True, max_length=512)
lgpd_pii_identifier_outputs = lgpd_pii_identifier(**tokens)

preds = [pred_mapper[np.argmax(pred)] for pred in lgpd_pii_identifier_outputs.logits.cpu().detach().numpy()]
```
## Author

  - [Felipe Casali](https://www.linkedin.com/in/felipecasali/)

## Paper

- Paper: WIP
- MBA thesis: [lgpd_pii_identifier: Proteção de Dados Sensíveis na Era da Inteligência Artificial](WIP)