Instructions to use lexedit/herbert-polish-legal-ner with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use lexedit/herbert-polish-legal-ner with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="lexedit/herbert-polish-legal-ner")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("lexedit/herbert-polish-legal-ner") model = AutoModelForTokenClassification.from_pretrained("lexedit/herbert-polish-legal-ner") - Notebooks
- Google Colab
- Kaggle
HerBERT Polish Legal NER — PII / Anonymization
A Polish token-classification (NER) model for detecting personally identifiable
information (PII) in legal and administrative text, fine-tuned from
allegro/herbert-base-cased.
This is the general-purpose variant — the best overall accuracy on clean, digital Polish legal documents.
For scanned / photographed (OCR'd) documents, use the OCR-robust sibling
lexedit/herbert-polish-legal-ner-ocr, which cuts the person-name leak on scanned text by ~35–41 % (at a small precision cost on clean text).
Intended use is defensive: flagging PII so it can be masked / anonymised before a document is shared or processed. It is not a guarantee of complete anonymisation — see Limitations.
▶ Try it in your browser: lexedit.ai/lexedit-privacy — an interactive, fully client-side Polish legal-document anonymisation demo (the text never leaves your browser).
📂 Example cases: examples/STRONG_CASES.md (hard
inputs it handles) · examples/KNOWN_LIMITATIONS.md
(where it still fails).
Labels (29, BIO scheme)
PER (person) · ORG (organisation) · LOC (private address/location) ·
LOC_PUB (public place: city, country) · DATE · MONEY · EMAIL · PHONE ·
ID (national id / case / document number) · IBAN · DIAGNOSIS ·
HEALTH_FACILITY · MEDICAL_ID · WATERMARK, each as B-… / I-…, plus O.
The model distinguishes private locations (LOC, masked) from public
ones (LOC_PUB, usually kept), and treats DATE / MONEY as non-anonymised by
default.
Usage (ONNX, no PyTorch required)
pip install onnxruntime transformers numpy
python examples/inference_onnx.py
import numpy as np, onnxruntime as ort
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("lexedit/herbert-polish-legal-ner")
sess = ort.InferenceSession("onnx/model_quantized.onnx")
enc = tok("Pozwany Jan Kowalski, PESEL 02070803628.",
return_offsets_mapping=True, return_tensors="np")
feeds = {"input_ids": enc["input_ids"].astype(np.int64),
"attention_mask": enc["attention_mask"].astype(np.int64)}
logits = sess.run(None, feeds)[0] # (1, seq, 29)
# argmax per token -> map ids via config.id2label -> group B-/I- with offset_mapping
Usage (PyTorch / transformers pipeline)
from transformers import pipeline
ner = pipeline("token-classification",
model="lexedit/herbert-polish-legal-ner",
aggregation_strategy="first")
ner("Pozwany Jan Kowalski, zam. ul. Słoneczna 5 w Krakowie.")
Runs in the browser (the intended setup)
This model is designed to run entirely client-side. The quantized ONNX (~125 MB, int8) loads in the browser via onnxruntime-web or transformers.js (WASM), so the document never leaves the user's device — which is the whole point for sensitive legal / medical text. It also runs anywhere ONNX Runtime does (Python, Node, server, mobile). The demo at lexedit.ai/lexedit-privacy is exactly this.
Speed (rough)
Quantized ONNX, measured on a laptop CPU (Apple Silicon):
| input | 1 core | all cores |
|---|---|---|
| short sentence (~30 tokens) | ~18 ms | ~13 ms |
| full chunk (~500 tokens) | ~0.2 s | ~0.1 s |
≈ 4–5 chunks/second single-threaded natively. In the browser (WASM, single-threaded) it is slower but practical: short text stays interactive, a 1–2 page document takes a few seconds, a large scanned document can take ~a minute or two; cross-origin isolation (COOP/COEP → multi-threaded WASM) speeds it up.
Recommended production setup
This model is recall-first and is one layer of a pipeline, not the whole solution. For best anonymisation, pair it with:
- A recall-first threshold for
PER(flip a token to PER when the summed PER probability ≥ ~0.2, even if it is not the arg-max). - A deterministic document-safety post-pass — snap spans to whole words, merge hyphenated surnames, and propagate a detected surname to its other inflected / OCR-variant mentions across the document.
- Checksum-validated regex for structured PII (PESEL, NIP, REGON, IBAN, …).
- Human review for high-stakes use.
Evaluation
Identity-level leak rate = a person is "leaked" if any mention of them is missed. Internal set of 50 real Polish legal documents (179 persons), recall-first threshold 0.2.
| Document type | metric | this model | OCR-robust variant |
|---|---|---|---|
| Clean / digital | leak (threshold) | 10.1% | 11.7% |
| Clean / digital | leak (+ post-pass) | 7.3% | 7.3% |
| Scanned / OCR'd | leak (threshold) | 31.8% | 20.7% |
| Scanned / OCR'd | leak (+ post-pass) | 25.7% | 15.1% |
Token-level test F1 ≈ 0.94.
Takeaway: this is the strongest variant on clean digital text (lowest leak, best precision). On scanned / OCR'd text it is weaker — there the OCR-robust variant wins. If you process both, route by document type.
Training data
Fine-tuned on Polish legal and administrative documents — a mix of document templates, programmatically-generated labelled examples (valid checksum-correct synthetic identifiers, rule-based Polish name declension), and real-world legal-document samples.
No raw personal data is distributed with this model. Because this is a token-classification model (it outputs a label per input token and cannot generate text), the weights do not reproduce or expose any training document.
Limitations
Concrete, reproducible failure cases: see
examples/KNOWN_LIMITATIONS.md(synthetic inputs). Heavily OCR-garbled names are the main weakness — for scanned documents prefer the OCR-robust variant.
- Polish only.
- Not a guarantee. A residual leak rate remains (≈7–10 % identity-level on clean text); always combine with the deterministic post-pass + checksum regex and human review for high-stakes use.
- Scanned / OCR'd text is the weak spot of this variant (heavily garbled names can be missed) — route those to the OCR-robust variant.
- Small evaluation set; numbers are indicative, not a benchmark.
- Not legal advice; not a substitute for a privacy/compliance review.
License — CC BY 4.0 (attribution required, commercial use allowed)
Released under Creative Commons Attribution 4.0. You may use, modify and redistribute this model — including in commercial products — provided you give appropriate credit. Attribution is required in any use, commercial or not; no other restrictions are added.
Suggested attribution:
Polish legal NER / anonymisation model by lexedit (https://lexedit.ai), licensed CC BY 4.0, fine-tuned from HerBERT (
allegro/herbert-base-cased).
This model is a derivative of HerBERT (Allegro), which is itself CC BY 4.0 — please retain attribution to the base model as well.
- Downloads last month
- 13
Model tree for lexedit/herbert-polish-legal-ner
Base model
allegro/herbert-base-casedEvaluation results
- Token-level F1 (held-out test) on Internal Polish legal documents (identity-level eval)self-reported0.940