YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

PII Detector: Named Entity Recognition (NER) for Personally Identifiable Information (PII)

Overview

This project implements a Named Entity Recognition (NER) model to detect Personally Identifiable Information (PII) using a fine-tuned DistilBERT model. It identifies various types of PII such as names, emails, usernames, ID numbers, phone numbers, URLs, and addresses in text.

Features

Synthetic data generation for PII-related entities.

Token classification using BIO tagging format.

Fine-tuning of DistilBERT for PII detection.

Model training with Hugging Face's Trainer API.

Inference pipeline for real-time PII detection.

Dataset

The dataset is synthetically generated using the Faker library. It includes:

Student names

Emails

Usernames

ID numbers

Phone numbers

Personal URLs

Street addresses

Each sentence is labeled with corresponding entities in BIO tagging format.

Installation

To set up the project, install the necessary dependencies:

pip install torch transformers datasets faker

Usage

  1. Generate Synthetic Data

Run the generate_synthetic_data function to create labeled text samples with PII entities.

  1. Tokenize and Align Labels

The function tokenize_and_align_labels tokenizes input text and aligns the entity labels using Hugging Face's tokenizer.

  1. Train the Model

Execute the training pipeline using:

trainer.train()

This will fine-tune DistilBERT on the labeled dataset.

  1. Save the Model

The trained model is saved using:

trainer.save_model("./pii_detector")

  1. Run Inference

To detect PII in a given text, use:

pii_detection("Sample text with PII information")

This will return identified entities along with their labels.

Model Configuration

Base Model: distilbert-base-uncased

Tokenizer: AutoTokenizer

Training Parameters:

Batch size: 16

Number of epochs: 3

Evaluation strategy: Per epoch

Device: CUDA (if available)

Output

The model returns a list of detected PII entities with their respective labels and positions in the text.

License

This project is open-source and can be used for educational and research purposes.

Downloads last month
6
Safetensors
Model size
66.4M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support