PII Detector: Named Entity Recognition (NER) for Personally Identifiable Information (PII)
Overview
This project implements a Named Entity Recognition (NER) model to detect Personally Identifiable Information (PII) using a fine-tuned DistilBERT model. It identifies various types of PII such as names, emails, usernames, ID numbers, phone numbers, URLs, and addresses in text.
Features
Synthetic data generation for PII-related entities.
Token classification using BIO tagging format.
Fine-tuning of DistilBERT for PII detection.
Model training with Hugging Face's Trainer API.
Inference pipeline for real-time PII detection.
Dataset
The dataset is synthetically generated using the Faker library. It includes:
Student names
Emails
Usernames
ID numbers
Phone numbers
Personal URLs
Street addresses
Each sentence is labeled with corresponding entities in BIO tagging format.
Installation
To set up the project, install the necessary dependencies:
pip install torch transformers datasets faker
Usage
- Generate Synthetic Data
Run the generate_synthetic_data function to create labeled text samples with PII entities.
- Tokenize and Align Labels
The function tokenize_and_align_labels tokenizes input text and aligns the entity labels using Hugging Face's tokenizer.
- Train the Model
Execute the training pipeline using:
trainer.train()
This will fine-tune DistilBERT on the labeled dataset.
- Save the Model
The trained model is saved using:
trainer.save_model("./pii_detector")
- Run Inference
To detect PII in a given text, use:
pii_detection("Sample text with PII information")
This will return identified entities along with their labels.
Model Configuration
Base Model: distilbert-base-uncased
Tokenizer: AutoTokenizer
Training Parameters:
Batch size: 16
Number of epochs: 3
Evaluation strategy: Per epoch
Device: CUDA (if available)
Output
The model returns a list of detected PII entities with their respective labels and positions in the text.
License
This project is open-source and can be used for educational and research purposes.
- Downloads last month
- 6