EU PII Anonimization Multilingual Detector - v2 (Preview)

Important This is preview model, wherre we will be doing rolling release of new checkpoints. It might change the output of the model as the times go.

A multilingual PII detector built for EU compliance reality: 24 official languages, GDPR special categories, AI Act high-risk data - in one model.

Zrzut ekranu 2026-05-5 o 08.47.14

Try it

🔗 Live demo in Browser

What's new in v2

v2 continues training from bardsai/eu-pii-anonimization-multilang on additional multilingual data and adds a dedicated DATE class for generic calendar dates - distinct from the existing DATE_OF_BIRTH category. This makes v2 a better fit for redacting timelines in documents, tickets, and chat logs where any date can be identifying in context. Entity coverage grows to 36 classes.

Why this exists

GDPR and the AI Act require you to detect and redact personal data across every language your users write in. Most open-source PII models were trained on English with a few translations bolted on, and they cover the basics - names, emails, phone numbers - while missing exactly the categories regulators care about: biometric data, genetic data, health information, political opinions, ethnic origin.

bardsai/eu-pii-anonimization-multilangv2 is trained end-to-end on real multilingual data (not English-translated), covers 36 entity classes mapped to GDPR Article 9 special categories and AI Act high-risk identifiers.

What's different

  • Native multilingual training. Real text in EU languages. Performance on Polish, German, French, Italian, and Spanish is comparable to the English baseline.
  • GDPR special categories covered. Health, biometric, genetic, and other Article 9 entities that most OSS PII models skip entirely.
  • Generic date detection. New DATE class on top of DATE_OF_BIRTH, so date-based identifiers in free text are caught too.

Who this is for

Compliance and privacy engineers at EU companies who need to:

  • Redact PII from documents, support tickets, emails, and chat logs before storage or analysis
  • Sanitize datasets before training, sharing, or moving across jurisdictions
  • Filter inputs to RAG pipelines and search indexes so personal data doesn't leak into prompts or logs
  • Build audit trails for what was redacted, when, and why

Quick start

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "bardsai/eu-pii-anonimization-multilangv2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

text = "John Smith, passport AB123456, phone +48 123 456 789"
inputs = tokenizer(text, return_tensors="pt", truncation=True)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[p.item()] for p in predictions[0]]

for token, label in zip(tokens, labels):
    if label != "O":
        print(label, token)

Or with a pipeline and span aggregation:

from transformers import pipeline

nlp = pipeline(
    "token-classification",
    model="bardsai/eu-pii-anonimization-multilangv2",
    aggregation_strategy="simple",
)
print(nlp("Jan Kowalski, PESEL 80010112345, born 1980-01-01, Warszawa."))

Entity coverage

36 classes across eight families, with B-/I- BIO labeling:

  • Personal identity - names, dates, dates of birth, national ID numbers
  • Contact and location - addresses, emails, phone numbers, geolocation
  • Official documents - passports, driver's licenses, tax IDs
  • Financial - IBAN, credit card, account numbers
  • Technical identifiers - IP addresses, MAC addresses, device IDs, usernames
  • Organization data - employer, institutional affiliations
  • Health, biometric, genetic (GDPR Art. 9) - medical conditions, biometric identifiers, genetic data
  • Special-category (GDPR Art. 9) - racial/ethnic origin, political opinions, religious beliefs, sexual orientation, trade union membership

Full label list in config.json (id2label / label2id).

Limitations

This is a model, not a compliance program. A few things to keep in mind:

  • Performance varies by language, domain, and input quality. OCR noise, code-switching, and unusual formatting will degrade recall.
  • Ambiguous mentions (a name that's also a place, an ID-shaped number that isn't an ID) need post-processing rules or human review.
  • Detection ≠ legal sufficiency. Use this to support a redaction workflow, not to replace your DPO's judgment.
  • Threshold tuning matters. The right operating point depends on whether you're optimizing for recall (compliance) or precision (data utility).

Files

  • model.safetensors - model weights
  • config.json - config and label mapping
  • tokenizer.json, tokenizer_config.json - tokenizer assets
  • special_tokens_map.json - special token mapping

Citation

@misc{bards.ai_2026,
    author       = { bards.ai and Karol Samorański and Michał Swędrowski and Michał Pogoda-Rosikoń },
    title        = { eu-pii-anonimization-multilang-v2-preview (Revision bea7d79) },
    year         = 2026,
    url          = { https://huggingface.co/bardsai/eu-pii-anonimization-multilang-v2-preview },
    doi          = { 10.57967/hf/9113 },
    publisher    = { Hugging Face }
}

About bards.ai

We build product ML for teams shipping AI to real users - RAG, agents, fine-tuned models, evals, and the unglamorous infrastructure that keeps them working. 16+ open models on Hugging Face, 10+ publications, production deployments at Comcast, Chili Piper, and Surfer SEO.

bards.ai

Downloads last month
73
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bardsai/eu-pii-anonimization-multilang-v2-preview

Finetuned
(4050)
this model

Collection including bardsai/eu-pii-anonimization-multilang-v2-preview