HerBERT Polish Legal NER — PII / Anonymization (OCR-robust)

A Polish token-classification (NER) model for detecting personally identifiable information (PII) in legal and administrative text, fine-tuned from allegro/herbert-base-cased.

This is the OCR-robust variant: it adds a render → OCR augmentation step so the model keeps recognising person names even when they are corrupted by optical character recognition (scanned / photographed documents — missing Polish diacritics, confusable characters, fragmented surnames).

For clean digital documents, the general-purpose sibling lexedit/herbert-polish-legal-ner is slightly more precise. Use this OCR-robust variant for scanned / photographed text.

Intended use is defensive: flagging PII so it can be masked / anonymised before a document is shared or processed. It is not a guarantee of complete anonymisation — see Limitations.

▶ Try it in your browser: lexedit.ai/lexedit-privacy — an interactive, fully client-side Polish legal-document anonymisation demo (same anonymiser family; the text never leaves your browser).

📂 Example cases: examples/STRONG_CASES.md (hard inputs it handles — OCR garble, declension, ALL-CAPS) · examples/KNOWN_LIMITATIONS.md (where it still fails).

Labels (29, BIO scheme)

PER (person) · ORG (organisation) · LOC (private address/location) · LOC_PUB (public place: city, country) · DATE · MONEY · EMAIL · PHONE · ID (national id / case / document number) · IBAN · DIAGNOSIS · HEALTH_FACILITY · MEDICAL_ID · WATERMARK, each as B-… / I-…, plus O.

The model distinguishes private locations (LOC, masked) from public ones (LOC_PUB, usually kept), and treats DATE / MONEY as non-anonymised by default.

Usage (ONNX, no PyTorch required)

pip install onnxruntime transformers numpy
python examples/inference_onnx.py
import numpy as np, onnxruntime as ort
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("lexedit/herbert-polish-legal-ner-ocr")
sess = ort.InferenceSession("onnx/model_quantized.onnx")
enc = tok("Pozwany Jan Kowalski, PESEL 02070803628.",
          return_offsets_mapping=True, return_tensors="np")
feeds = {"input_ids": enc["input_ids"].astype(np.int64),
         "attention_mask": enc["attention_mask"].astype(np.int64)}
logits = sess.run(None, feeds)[0]          # (1, seq, 29)
# argmax per token -> map ids via config.id2label -> group B-/I- with offset_mapping

Usage (PyTorch / transformers pipeline)

from transformers import pipeline
ner = pipeline("token-classification",
               model="lexedit/herbert-polish-legal-ner-ocr",
               aggregation_strategy="first")
ner("Pozwany Jan Kowalski, zam. ul. Słoneczna 5 w Krakowie.")

Runs in the browser (the intended setup)

This model is designed to run entirely client-side. The quantized ONNX (~125 MB, int8) loads in the browser via onnxruntime-web or transformers.js (WASM), so the document never leaves the user's device — which is the whole point for sensitive legal / medical text. It also runs anywhere ONNX Runtime does (Python, Node, server, mobile). The interactive demo at lexedit.ai/lexedit-privacy is exactly this: in-browser, nothing uploaded.

Speed (rough)

Quantized ONNX, measured on a laptop CPU (Apple Silicon):

input 1 core all cores
short sentence (~30 tokens) ~18 ms ~13 ms
full chunk (~500 tokens) ~0.2 s ~0.1 s

4–5 chunks/second single-threaded natively. In the browser (WASM, single-threaded) it is slower but still practical: short text stays interactive (well under a second), a 1–2 page document takes a few seconds, and a large scanned document (tens of thousands of characters) can take ~a minute or two. Enabling cross-origin isolation (COOP/COEP → SharedArrayBuffer → multi-threaded WASM) speeds the browser path up considerably.

Recommended production setup

This model is recall-first and is meant to be one layer of a pipeline, not the whole solution. For best anonymisation results, pair it with:

  1. A recall-first threshold for PER (flip a token to PER when the summed PER probability ≥ ~0.2, even if it is not the arg-max) — anonymisation wants recall ≫ precision.
  2. A deterministic document-safety post-pass — snap predicted spans to whole words, merge hyphenated surnames, and propagate a detected surname to its other inflected / OCR-variant mentions across the document.
  3. Checksum-validated regular expressions for structured PII (PESEL, NIP, REGON, IBAN, passport / ID numbers) — these are better handled deterministically than by the model.
  4. Human review for anything high-stakes.

Evaluation

Identity-level leak rate = a person is "leaked" if any mention of them is missed (the metric that matters for anonymisation). Measured on a small internal set of 50 real Polish legal documents (179 persons), recall-first threshold 0.2.

Document type metric this model non-OCR baseline
Clean / digital leak (threshold) 11.7% 10.1%
Clean / digital leak (+ post-pass) 7.3% 7.3%
Scanned / OCR'd leak (threshold) 20.7% 31.8%
Scanned / OCR'd leak (+ post-pass) 15.1% 25.7%

Token-level test F1 ≈ 0.94.

Takeaway: on scanned / OCR'd documents this variant cuts the person-name leak by ~35–41 % relative vs a model trained without OCR augmentation, at the cost of a small (~2–3 pt) precision drop on clean digital text. If you process both clean and scanned documents, consider routing scanned/photographed uploads to this model and clean digital documents to a non-OCR-augmented variant.

Training data

Fine-tuned on Polish legal and administrative documents — a mix of document templates, programmatically-generated labelled examples (with valid, checksum- correct synthetic identifiers and rule-based Polish name declension), and real-world legal-document samples. A render → OCR augmentation pass renders person names with varied fonts and scan-style degradation, runs them through OCR engines, and injects the resulting realistic OCR-corrupted name surfaces (with exact labels) as a small (~3 %) fraction of the training set.

No raw personal data is distributed with this model. Because this is a token-classification model (it outputs a label per input token and cannot generate text), the weights do not reproduce or expose any training document.

Limitations

Concrete, reproducible failure cases: see examples/KNOWN_LIMITATIONS.md (synthetic inputs) — heavily OCR-garbled names can be missed entirely or only partially caught.

  • Polish only.
  • Not a guarantee. A residual leak rate remains (≈7–15 % identity-level on the internal eval); always combine with the deterministic post-pass + checksum regex above and human review for high-stakes use.
  • OCR-robust trade-off: slightly lower precision on clean digital text than a non-OCR-augmented model — prefer document-type routing if both kinds occur.
  • Small evaluation set; numbers are indicative, not a benchmark.
  • Not legal advice; not a substitute for a privacy/compliance review.

License — CC BY 4.0 (attribution required, commercial use allowed)

Released under Creative Commons Attribution 4.0. You may use, modify and redistribute this model — including in commercial productsprovided you give appropriate credit. Attribution is required in any use, commercial or not; no other restrictions are added.

Suggested attribution:

Polish legal NER / anonymisation model by lexedit (https://lexedit.ai), licensed CC BY 4.0, fine-tuned from HerBERT (allegro/herbert-base-cased).

This model is a derivative of HerBERT (Allegro), which is itself CC BY 4.0 — please retain attribution to the base model as well.

Downloads last month
24
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lexedit/herbert-polish-legal-ner-ocr

Finetuned
(14)
this model

Evaluation results

  • Token-level F1 (held-out test) on Internal Polish legal documents (identity-level eval)
    self-reported
    0.940