PII Detection Model — IndicBERTv2
A token classification model for detecting and redacting Personally Identifiable Information (PII) in English, Hindi, Hinglish, and Devanagari text.
Built on ai4bharat/IndicBERTv2-MLM-only — a 278M parameter BERT model pretrained on 20.9B tokens across 24 Indian languages.
Supported Languages
- English — names, addresses, phone numbers, SSN, etc.
- Hindi (Devanagari) — राजेश कुमार, मुंबई, बीस हज़ार रुपये, पंद्रह मार्च
- Hinglish — "Mera naam Rajesh hai aur main Mumbai mein rehta hoon"
- Mixed Devanagari + English — "मेरा phone number 9876543210 है"
Entity Types (31)
| Entity | Description | Example |
|---|---|---|
| FIRSTNAME | First name | Rajesh, राजेश, John |
| LASTNAME | Last name | Kumar, कुमार, Smith |
| MIDDLENAME | Middle name | Kumar |
| PREFIX | Title/prefix | Mr, श्री, Dr, श्रीमती |
| GENDER | Gender | male, female |
| SEX | Sex | M, F |
| AGE | Age | 35 |
| DOB | Date of birth | 15/03/1990 |
| DATE | General date | पंद्रह मार्च, March fifteenth |
| Email address | priya@gmail.com | |
| PHONENUMBER | Phone number | +91 98765 43210 |
| CITY | City | Mumbai, मुंबई, Boston |
| STATE | State | Maharashtra, महाराष्ट्र |
| COUNTY | County | Cook County |
| ZIPCODE | ZIP/PIN code | 400001, 02101 |
| STREET | Street name | MG Road, Oak Avenue |
| BUILDINGNUMBER | Building number | 42 |
| SECONDARYADDRESS | Apt/Suite | Flat 301 |
| COMPANYNAME | Company | Infosys, टाटा कंसल्टेंसी |
| ACCOUNTNUMBER | Account number | 9876543210 |
| ACCOUNTNAME | Account name | Tata Consultancy |
| CREDITCARDNUMBER | Credit card | 4111-1111-1111-1111 |
| CREDITCARDCVV | CVV | 123 |
| CREDITCARDISSUER | Card issuer | Visa, HDFC |
| SSN | SSN/PAN/Aadhaar | 123-45-6789, ABCDE1234F |
| IBAN | IBAN | IN89UTIB00001234567890 |
| PIN | ATM/Security PIN | 4098 |
| PASSWORD | Password | S3cur3P@ss! |
| USERNAME | Username | mdavis |
| URL | URL | www.example.com |
| AMOUNT | Money amount | 50000, बीस हज़ार रुपये |
Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model = AutoModelForTokenClassification.from_pretrained("hiteshwadhwani/pii-model-indicv2")
tokenizer = AutoTokenizer.from_pretrained("hiteshwadhwani/pii-model-indicv2")
ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="first")
# English
results = ner("Mr John Smith lives at 456 Oak Avenue Boston")
# Hinglish
results = ner("Mera naam Rajesh Kumar hai aur main Mumbai mein rehta hoon")
# Hindi (Devanagari)
results = ner("मेरा नाम राजेश कुमार है और मैं मुंबई में रहता हूं")
# Hindi amounts and dates in words
results = ner("आपके क्रेडिट कार्ड के बीस हज़ार रूपये शुल्क लंबित हैं, जो पंद्रह मार्च को देना था।")
for entity in results:
if entity["entity_group"] != "O":
print(f"{entity['word']} → {entity['entity_group']} ({entity['score']:.2f})")
Redaction Example
def redact_pii(text, ner_pipeline, threshold=0.5):
results = ner_pipeline(text)
entities = [r for r in results if r["score"] >= threshold and r["entity_group"] != "O"]
entities.sort(key=lambda x: x["start"])
merged = []
for ent in entities:
label = ent["entity_group"]
if merged and merged[-1]["label"] == label and ent["start"] <= merged[-1]["end"] + 1:
merged[-1]["end"] = max(merged[-1]["end"], ent["end"])
else:
merged.append({"label": label, "start": ent["start"], "end": ent["end"]})
redacted = text
for span in reversed(merged):
redacted = redacted[:span["start"]] + f"[{span['label']}]" + redacted[span["end"]:]
return redacted
print(redact_pii("Shri Rajesh Kumar lives at 42 MG Road Bengaluru Karnataka", ner))
# [PREFIX] [FIRSTNAME] [LASTNAME] lives at [BUILDINGNUMBER] [STREET] [CITY] [STATE]
print(redact_pii("आपके बीस हज़ार रूपये शुल्क लंबित हैं, जो पंद्रह मार्च को देना था।", ner))
# आपके [AMOUNT] शुल्क लंबित हैं, जो [DATE] को देना था।
Evaluation Results
| Metric | Score |
|---|---|
| Overall F1 | 0.9497 |
| Precision | 0.9464 |
| Recall | 0.9530 |
Per-Entity F1
| Entity | F1 | Entity | F1 |
|---|---|---|---|
| FIRSTNAME | 0.99 | LASTNAME | 0.99 |
| CITY | 0.99 | STATE | 0.99 |
| PHONENUMBER | 0.98 | 0.93 | |
| DOB | 0.96 | DATE | 0.95 |
| COMPANYNAME | 0.97 | PREFIX | 0.98 |
| CREDITCARDNUMBER | 0.95 | CREDITCARDISSUER | 0.95 |
| URL | 0.99 | USERNAME | 0.99 |
| MIDDLENAME | 1.00 | ACCOUNTNUMBER | 0.91 |
| PASSWORD | 0.91 | ZIPCODE | 0.95 |
| AMOUNT | 0.86 | STREET | 0.91 |
Why IndicBERTv2?
| indic-bert (v1) | IndicBERTv2 (this model) | |
|---|---|---|
| Parameters | 32M | 278M |
| Architecture | ALBERT (shared layers) | BERT (unique layers) |
| Languages | 12 | 24 |
| Training corpus | ~9B tokens | 20.9B tokens |
| Devanagari names | Partial subword issues | Clean detection |
| Hindi amounts (बीस हज़ार) | Not detected | Detected |
| Hindi dates (पंद्रह मार्च) | Not detected | Detected |
| Unusual names (Viteshwar) | Broken subword alignment | Clean detection |
Training Details
- Base model: ai4bharat/IndicBERTv2-MLM-only (278M params, 24 languages)
- Task: Token Classification (NER with BIO tagging)
- Epochs: 10
- Learning rate: 2e-5
- Batch size: 16
- Optimizer: AdamW
Limitations
- SEX entity has low F1 — overlaps with GENDER
- SECONDARYADDRESS detection is weak
- Latency is higher than v1 (~25-40ms vs ~8ms) due to larger model size
- Best suited for Indian and US PII patterns
License
Apache 2.0
- Downloads last month
- 45
Model tree for hiteshwadhwani/pii-model-indicv2
Base model
ai4bharat/IndicBERTv2-MLM-onlyEvaluation results
- F1self-reported0.950
- Precisionself-reported0.946
- Recallself-reported0.953