You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

DeBERTa-v3 Base PII & NER Model

This is a fine-tuned token classification model based on microsoft/deberta-v3-base, designed specifically to detect Personally Identifiable Information (PII) and Named Entities. It has been uniquely patched to recognize research Grant IDs and other structural academic funding formats in text.

Intended Use

This model is intended for Named Entity Recognition (NER) and PII masking/redaction pipelines. It is particularly well-suited for academic manuscripts, research papers, and general textual data where identifying and masking PII (like author names, affiliations, grant numbers, and contact details) is necessary.

Supported Labels

The model was trained to predict the following entities:

PER: Persons (First names, last names, middle names, titles)
ORG: Organizations (Companies, departments, universities, funding agencies)
LOC: Locations (Cities, countries, states, street addresses, zip codes)
EMAIL: Email addresses
URL: Websites and URLs
PHONE: Phone numbers, mobile numbers
ID: Identification numbers (Grant IDs, SSNs, Passports, Tax IDs, Account numbers)
MISC: Miscellaneous PII

Training Data

Base Fine-Tuning: The model was trained on a diverse mixture of 15,000 PII samples to ensure robustness across different domains:
- ~53% (8,000 samples): English split of the ai4privacy/open-pii-masking-500k-ai4privacy dataset (Standard PII).
- ~27% (4,000 samples): cometadata/arxiv-author-affiliations (Academic/STEM manuscript author names and institutions).
- ~20% (3,000 samples): nvidia/Nemotron-PII (High-quality synthetic PII).
The complex labels from these datasets were mapped down to the 8 core categories listed above for simplicity and consistency.
Grant ID Patching: To specifically support academic parsing, the model underwent a secondary rapid fine-tuning using a balanced mix to prevent catastrophic forgetting:
- 50% (2,000 samples): Custom synthetic dataset of research funding acknowledgments. This patch allows the model to accurately detect agency names (as ORG) and Grant/Award Numbers (as ID).
- 50% (2,000 samples): Baseline data from ai4privacy to retain knowledge of standard PII.

Evaluation

Performance Metrics

During the primary fine-tuning phase (evaluated on a proportionally mixed hold-out set of 2,000 samples), the model achieved the following overall metrics:

Overall F1 Score: 0.9556
Overall Accuracy: 0.9949
Overall Precision: 0.9559
Overall Recall: 0.9553

Entity-Level Breakdown

The model maintains incredibly high performance on structured entities like EMAIL, URL, and IDs, while achieving a robust >0.90 F1 on ambiguous entities like PER (Persons).

Entity Type	Precision	Recall	F1-Score
EMAIL	0.9975	0.9987	0.9981
URL	0.9975	0.9963	0.9969
ID	0.9936	0.9936	0.9936
ORG	0.9793	0.9945	0.9869
LOC	0.9725	0.9809	0.9767
MISC	0.9754	0.9744	0.9749
PHONE	0.9525	0.9454	0.9489
PER	0.9077	0.9030	0.9053

Out-of-Distribution (OOD) Testing

The model has been rigorously evaluated on Out-of-Distribution (OOD) data to measure its zero-shot generalization capabilities. It was tested on a 100% unseen dataset from the cometadata/arxiv-author-affiliations (arXiv test split), which includes medical, mathematical, and computer science manuscripts.

Qualitative evaluations on these real-world academic papers demonstrated highly accurate detection of author names (PER), academic institutions (ORG), locations (LOC), and email addresses (EMAIL) without any catastrophic forgetting of standard PII formats.

Furthermore, tests on specifically formatted structural texts (e.g., standard "Acknowledgments" sections, footnotes, and funding disclosures) confirm the model's reliability in detecting Research Grants and Awards as ID entities alongside the associated Funding Agencies as ORG entities.

Note: The detailed, interactive HTML evaluation reports have been uploaded to this repository for full transparency.

How to Use

You can easily use this model with the Hugging Face pipeline:

from transformers import pipeline

# Load the model
pii_pipeline = pipeline(
    "token-classification", 
    model="your-username/deberta-v3-base-pii-ner", 
    aggregation_strategy="simple"
)

# Run inference
text = "This research was funded by the National Science Foundation (Grant No. NSF-1234567). Please contact Dr. Jane Doe at j.doe@university.edu."
results = pii_pipeline(text)

for entity in results:
    print(f"Entity: {entity['word']} | Label: {entity['entity_group']} | Confidence: {entity['score']:.2f}")

Downloads last month: 5

Safetensors

Model size

0.2B params

Tensor type

F32

mkrzystanek
/

deberta-v3-base-pii-ner