HerBERT Polish Legal NER — PII / Anonymization

A Polish token-classification (NER) model for detecting personally identifiable information (PII) in legal and administrative text, fine-tuned from allegro/herbert-base-cased.

This is the general-purpose variant — the best overall accuracy on clean, digital Polish legal documents.

For scanned / photographed (OCR'd) documents, use the OCR-robust sibling lexedit/herbert-polish-legal-ner-ocr, which cuts the person-name leak on scanned text by ~35–41 % (at a small precision cost on clean text).

Intended use is defensive: flagging PII so it can be masked / anonymised before a document is shared or processed. It is not a guarantee of complete anonymisation — see Limitations.

▶ Try it in your browser: lexedit.ai/lexedit-privacy — an interactive, fully client-side Polish legal-document anonymisation demo (the text never leaves your browser).

📂 Example cases: examples/STRONG_CASES.md (hard inputs it handles) · examples/KNOWN_LIMITATIONS.md (where it still fails).

Labels (29, BIO scheme)

PER (person) · ORG (organisation) · LOC (private address/location) · LOC_PUB (public place: city, country) · DATE · MONEY · EMAIL · PHONE · ID (national id / case / document number) · IBAN · DIAGNOSIS · HEALTH_FACILITY · MEDICAL_ID · WATERMARK, each as B-… / I-…, plus O.

The model distinguishes private locations (LOC, masked) from public ones (LOC_PUB, usually kept), and treats DATE / MONEY as non-anonymised by default.

Usage (ONNX, no PyTorch required)

pip install onnxruntime transformers numpy
python examples/inference_onnx.py

import numpy as np, onnxruntime as ort
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("lexedit/herbert-polish-legal-ner")
sess = ort.InferenceSession("onnx/model_quantized.onnx")
enc = tok("Pozwany Jan Kowalski, PESEL 02070803628.",
          return_offsets_mapping=True, return_tensors="np")
feeds = {"input_ids": enc["input_ids"].astype(np.int64),
         "attention_mask": enc["attention_mask"].astype(np.int64)}
logits = sess.run(None, feeds)[0]          # (1, seq, 29)
# argmax per token -> map ids via config.id2label -> group B-/I- with offset_mapping

Usage (PyTorch / transformers pipeline)

from transformers import pipeline
ner = pipeline("token-classification",
               model="lexedit/herbert-polish-legal-ner",
               aggregation_strategy="first")
ner("Pozwany Jan Kowalski, zam. ul. Słoneczna 5 w Krakowie.")

Runs in the browser (the intended setup)

This model is designed to run entirely client-side. The quantized ONNX (~125 MB, int8) loads in the browser via onnxruntime-web or transformers.js (WASM), so the document never leaves the user's device — which is the whole point for sensitive legal / medical text. It also runs anywhere ONNX Runtime does (Python, Node, server, mobile). The demo at lexedit.ai/lexedit-privacy is exactly this.

Speed (rough)

Quantized ONNX, measured on a laptop CPU (Apple Silicon):

input	1 core	all cores
short sentence (~30 tokens)	~18 ms	~13 ms
full chunk (~500 tokens)	~0.2 s	~0.1 s

≈ 4–5 chunks/second single-threaded natively. In the browser (WASM, single-threaded) it is slower but practical: short text stays interactive, a 1–2 page document takes a few seconds, a large scanned document can take ~a minute or two; cross-origin isolation (COOP/COEP → multi-threaded WASM) speeds it up.

Recommended production setup

This model is recall-first and is one layer of a pipeline, not the whole solution. For best anonymisation, pair it with:

A recall-first threshold for PER (flip a token to PER when the summed PER probability ≥ ~0.2, even if it is not the arg-max).
A deterministic document-safety post-pass — snap spans to whole words, merge hyphenated surnames, and propagate a detected surname to its other inflected / OCR-variant mentions across the document.
Checksum-validated regex for structured PII (PESEL, NIP, REGON, IBAN, …).
Human review for high-stakes use.

Evaluation

Identity-level leak rate = a person is "leaked" if any mention of them is missed. Internal set of 50 real Polish legal documents (179 persons), recall-first threshold 0.2.

Document type	metric	this model	OCR-robust variant
Clean / digital	leak (threshold)	10.1%	11.7%
Clean / digital	leak (+ post-pass)	7.3%	7.3%
Scanned / OCR'd	leak (threshold)	31.8%	20.7%
Scanned / OCR'd	leak (+ post-pass)	25.7%	15.1%

Token-level test F1 ≈ 0.94.

Takeaway: this is the strongest variant on clean digital text (lowest leak, best precision). On scanned / OCR'd text it is weaker — there the OCR-robust variant wins. If you process both, route by document type.

Training data

Fine-tuned on Polish legal and administrative documents — a mix of document templates, programmatically-generated labelled examples (valid checksum-correct synthetic identifiers, rule-based Polish name declension), and real-world legal-document samples.

No raw personal data is distributed with this model. Because this is a token-classification model (it outputs a label per input token and cannot generate text), the weights do not reproduce or expose any training document.

Limitations

Concrete, reproducible failure cases: see examples/KNOWN_LIMITATIONS.md (synthetic inputs). Heavily OCR-garbled names are the main weakness — for scanned documents prefer the OCR-robust variant.

Polish only.
Not a guarantee. A residual leak rate remains (≈7–10 % identity-level on clean text); always combine with the deterministic post-pass + checksum regex and human review for high-stakes use.
Scanned / OCR'd text is the weak spot of this variant (heavily garbled names can be missed) — route those to the OCR-robust variant.
Small evaluation set; numbers are indicative, not a benchmark.
Not legal advice; not a substitute for a privacy/compliance review.

License — CC BY 4.0 (attribution required, commercial use allowed)

Released under Creative Commons Attribution 4.0. You may use, modify and redistribute this model — including in commercial products — provided you give appropriate credit. Attribution is required in any use, commercial or not; no other restrictions are added.

Suggested attribution:

Polish legal NER / anonymisation model by lexedit (https://lexedit.ai), licensed CC BY 4.0, fine-tuned from HerBERT (allegro/herbert-base-cased).

This model is a derivative of HerBERT (Allegro), which is itself CC BY 4.0 — please retain attribution to the base model as well.

Downloads last month: 13

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for lexedit/herbert-polish-legal-ner

Base model

allegro/herbert-base-cased

Finetuned

(14)

this model

Evaluation results

Token-level F1 (held-out test) on Internal Polish legal documents (identity-level eval)
self-reported

0.940