DeBERTa-v3 Base PII & NER Model
This is a fine-tuned token classification model based on microsoft/deberta-v3-base, designed specifically to detect Personally Identifiable Information (PII) and Named Entities. It has been uniquely patched to recognize research Grant IDs and other structural academic funding formats in text.
Intended Use
This model is intended for Named Entity Recognition (NER) and PII masking/redaction pipelines. It is particularly well-suited for academic manuscripts, research papers, and general textual data where identifying and masking PII (like author names, affiliations, grant numbers, and contact details) is necessary.
Supported Labels
The model was trained to predict the following entities:
- PER: Persons (First names, last names, middle names, titles)
- ORG: Organizations (Companies, departments, universities, funding agencies)
- LOC: Locations (Cities, countries, states, street addresses, zip codes)
- EMAIL: Email addresses
- URL: Websites and URLs
- PHONE: Phone numbers, mobile numbers
- ID: Identification numbers (Grant IDs, SSNs, Passports, Tax IDs, Account numbers)
- MISC: Miscellaneous PII
Training Data
Base Fine-Tuning: The model was trained on a diverse mixture of 15,000 PII samples to ensure robustness across different domains:
- ~53% (8,000 samples): English split of the
ai4privacy/open-pii-masking-500k-ai4privacydataset (Standard PII). - ~27% (4,000 samples):
cometadata/arxiv-author-affiliations(Academic/STEM manuscript author names and institutions). - ~20% (3,000 samples):
nvidia/Nemotron-PII(High-quality synthetic PII).
The complex labels from these datasets were mapped down to the 8 core categories listed above for simplicity and consistency.
- ~53% (8,000 samples): English split of the
Grant ID Patching: To specifically support academic parsing, the model underwent a secondary rapid fine-tuning using a balanced mix to prevent catastrophic forgetting:
- 50% (2,000 samples): Custom synthetic dataset of research funding acknowledgments. This patch allows the model to accurately detect agency names (as
ORG) and Grant/Award Numbers (asID). - 50% (2,000 samples): Baseline data from
ai4privacyto retain knowledge of standard PII.
- 50% (2,000 samples): Custom synthetic dataset of research funding acknowledgments. This patch allows the model to accurately detect agency names (as
Evaluation
Performance Metrics
During the primary fine-tuning phase (evaluated on a proportionally mixed hold-out set of 2,000 samples), the model achieved the following overall metrics:
- Overall F1 Score: 0.9556
- Overall Accuracy: 0.9949
- Overall Precision: 0.9559
- Overall Recall: 0.9553
Entity-Level Breakdown
The model maintains incredibly high performance on structured entities like EMAIL, URL, and IDs, while achieving a robust >0.90 F1 on ambiguous entities like PER (Persons).
| Entity Type | Precision | Recall | F1-Score |
|---|---|---|---|
| 0.9975 | 0.9987 | 0.9981 | |
| URL | 0.9975 | 0.9963 | 0.9969 |
| ID | 0.9936 | 0.9936 | 0.9936 |
| ORG | 0.9793 | 0.9945 | 0.9869 |
| LOC | 0.9725 | 0.9809 | 0.9767 |
| MISC | 0.9754 | 0.9744 | 0.9749 |
| PHONE | 0.9525 | 0.9454 | 0.9489 |
| PER | 0.9077 | 0.9030 | 0.9053 |
Out-of-Distribution (OOD) Testing
The model has been rigorously evaluated on Out-of-Distribution (OOD) data to measure its zero-shot generalization capabilities. It was tested on a 100% unseen dataset from the cometadata/arxiv-author-affiliations (arXiv test split), which includes medical, mathematical, and computer science manuscripts.
Qualitative evaluations on these real-world academic papers demonstrated highly accurate detection of author names (PER), academic institutions (ORG), locations (LOC), and email addresses (EMAIL) without any catastrophic forgetting of standard PII formats.
Furthermore, tests on specifically formatted structural texts (e.g., standard "Acknowledgments" sections, footnotes, and funding disclosures) confirm the model's reliability in detecting Research Grants and Awards as ID entities alongside the associated Funding Agencies as ORG entities.
Note: The detailed, interactive HTML evaluation reports have been uploaded to this repository for full transparency.
How to Use
You can easily use this model with the Hugging Face pipeline:
from transformers import pipeline
# Load the model
pii_pipeline = pipeline(
"token-classification",
model="your-username/deberta-v3-base-pii-ner",
aggregation_strategy="simple"
)
# Run inference
text = "This research was funded by the National Science Foundation (Grant No. NSF-1234567). Please contact Dr. Jane Doe at j.doe@university.edu."
results = pii_pipeline(text)
for entity in results:
print(f"Entity: {entity['word']} | Label: {entity['entity_group']} | Confidence: {entity['score']:.2f}")
- Downloads last month
- 5