EU PII Anonimization Multilingual Detector - v2 (Preview)

Important This is preview model, wherre we will be doing rolling release of new checkpoints. It might change the output of the model as the times go.

A multilingual PII detector built for EU compliance reality: 24 official languages, GDPR special categories, AI Act high-risk data - in one model.

Try it

🔗 Live demo in Browser

What's new in v2

v2 continues training from bardsai/eu-pii-anonimization-multilang on additional multilingual data and adds a dedicated DATE class for generic calendar dates - distinct from the existing DATE_OF_BIRTH category. This makes v2 a better fit for redacting timelines in documents, tickets, and chat logs where any date can be identifying in context. Entity coverage grows to 36 classes.

Why this exists

GDPR and the AI Act require you to detect and redact personal data across every language your users write in. Most open-source PII models were trained on English with a few translations bolted on, and they cover the basics - names, emails, phone numbers - while missing exactly the categories regulators care about: biometric data, genetic data, health information, political opinions, ethnic origin.

bardsai/eu-pii-anonimization-multilangv2 is trained end-to-end on real multilingual data (not English-translated), covers 36 entity classes mapped to GDPR Article 9 special categories and AI Act high-risk identifiers.

What's different

Native multilingual training. Real text in EU languages. Performance on Polish, German, French, Italian, and Spanish is comparable to the English baseline.
GDPR special categories covered. Health, biometric, genetic, and other Article 9 entities that most OSS PII models skip entirely.
Generic date detection. New DATE class on top of DATE_OF_BIRTH, so date-based identifiers in free text are caught too.

Who this is for

Compliance and privacy engineers at EU companies who need to:

Redact PII from documents, support tickets, emails, and chat logs before storage or analysis
Sanitize datasets before training, sharing, or moving across jurisdictions
Filter inputs to RAG pipelines and search indexes so personal data doesn't leak into prompts or logs
Build audit trails for what was redacted, when, and why

Quick start

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "bardsai/eu-pii-anonimization-multilangv2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

text = "John Smith, passport AB123456, phone +48 123 456 789"
inputs = tokenizer(text, return_tensors="pt", truncation=True)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[p.item()] for p in predictions[0]]

for token, label in zip(tokens, labels):
    if label != "O":
        print(label, token)

Or with a pipeline and span aggregation:

from transformers import pipeline

nlp = pipeline(
    "token-classification",
    model="bardsai/eu-pii-anonimization-multilangv2",
    aggregation_strategy="simple",
)
print(nlp("Jan Kowalski, PESEL 80010112345, born 1980-01-01, Warszawa."))

Entity coverage

36 classes across eight families, with B-/I- BIO labeling:

Personal identity - names, dates, dates of birth, national ID numbers
Contact and location - addresses, emails, phone numbers, geolocation
Official documents - passports, driver's licenses, tax IDs
Financial - IBAN, credit card, account numbers
Technical identifiers - IP addresses, MAC addresses, device IDs, usernames
Organization data - employer, institutional affiliations
Health, biometric, genetic (GDPR Art. 9) - medical conditions, biometric identifiers, genetic data
Special-category (GDPR Art. 9) - racial/ethnic origin, political opinions, religious beliefs, sexual orientation, trade union membership

Full label list in config.json (id2label / label2id).

Limitations

This is a model, not a compliance program. A few things to keep in mind:

Performance varies by language, domain, and input quality. OCR noise, code-switching, and unusual formatting will degrade recall.
Ambiguous mentions (a name that's also a place, an ID-shaped number that isn't an ID) need post-processing rules or human review.
Detection ≠ legal sufficiency. Use this to support a redaction workflow, not to replace your DPO's judgment.
Threshold tuning matters. The right operating point depends on whether you're optimizing for recall (compliance) or precision (data utility).

Files

model.safetensors - model weights
config.json - config and label mapping
tokenizer.json, tokenizer_config.json - tokenizer assets
special_tokens_map.json - special token mapping

Citation

@misc{bards.ai_2026,
    author       = { bards.ai and Karol Samorański and Michał Swędrowski and Michał Pogoda-Rosikoń },
    title        = { eu-pii-anonimization-multilang-v2-preview (Revision bea7d79) },
    year         = 2026,
    url          = { https://huggingface.co/bardsai/eu-pii-anonimization-multilang-v2-preview },
    doi          = { 10.57967/hf/9113 },
    publisher    = { Hugging Face }
}

About bards.ai

We build product ML for teams shipping AI to real users - RAG, agents, fine-tuned models, evals, and the unglamorous infrastructure that keeps them working. 16+ open models on Hugging Face, 10+ publications, production deployments at Comcast, Chili Piper, and Surfer SEO.

bards.ai