Instructions to use lexedit/herbert-polish-legal-ner-ocr with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use lexedit/herbert-polish-legal-ner-ocr with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="lexedit/herbert-polish-legal-ner-ocr")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("lexedit/herbert-polish-legal-ner-ocr") model = AutoModelForTokenClassification.from_pretrained("lexedit/herbert-polish-legal-ner-ocr") - Notebooks
- Google Colab
- Kaggle
HerBERT Polish Legal NER — PII / Anonymization (OCR-robust)
A Polish token-classification (NER) model for detecting personally identifiable
information (PII) in legal and administrative text, fine-tuned from
allegro/herbert-base-cased.
This is the OCR-robust variant: it adds a render → OCR augmentation step so the model keeps recognising person names even when they are corrupted by optical character recognition (scanned / photographed documents — missing Polish diacritics, confusable characters, fragmented surnames).
For clean digital documents, the general-purpose sibling
lexedit/herbert-polish-legal-neris slightly more precise. Use this OCR-robust variant for scanned / photographed text.
Intended use is defensive: flagging PII so it can be masked / anonymised before a document is shared or processed. It is not a guarantee of complete anonymisation — see Limitations.
▶ Try it in your browser: lexedit.ai/lexedit-privacy — an interactive, fully client-side Polish legal-document anonymisation demo (same anonymiser family; the text never leaves your browser).
📂 Example cases: examples/STRONG_CASES.md (hard
inputs it handles — OCR garble, declension, ALL-CAPS) ·
examples/KNOWN_LIMITATIONS.md (where it still fails).
Labels (29, BIO scheme)
PER (person) · ORG (organisation) · LOC (private address/location) ·
LOC_PUB (public place: city, country) · DATE · MONEY · EMAIL · PHONE ·
ID (national id / case / document number) · IBAN · DIAGNOSIS ·
HEALTH_FACILITY · MEDICAL_ID · WATERMARK, each as B-… / I-…, plus O.
The model distinguishes private locations (LOC, masked) from public
ones (LOC_PUB, usually kept), and treats DATE / MONEY as non-anonymised by
default.
Usage (ONNX, no PyTorch required)
pip install onnxruntime transformers numpy
python examples/inference_onnx.py
import numpy as np, onnxruntime as ort
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("lexedit/herbert-polish-legal-ner-ocr")
sess = ort.InferenceSession("onnx/model_quantized.onnx")
enc = tok("Pozwany Jan Kowalski, PESEL 02070803628.",
return_offsets_mapping=True, return_tensors="np")
feeds = {"input_ids": enc["input_ids"].astype(np.int64),
"attention_mask": enc["attention_mask"].astype(np.int64)}
logits = sess.run(None, feeds)[0] # (1, seq, 29)
# argmax per token -> map ids via config.id2label -> group B-/I- with offset_mapping
Usage (PyTorch / transformers pipeline)
from transformers import pipeline
ner = pipeline("token-classification",
model="lexedit/herbert-polish-legal-ner-ocr",
aggregation_strategy="first")
ner("Pozwany Jan Kowalski, zam. ul. Słoneczna 5 w Krakowie.")
Runs in the browser (the intended setup)
This model is designed to run entirely client-side. The quantized ONNX (~125 MB, int8) loads in the browser via onnxruntime-web or transformers.js (WASM), so the document never leaves the user's device — which is the whole point for sensitive legal / medical text. It also runs anywhere ONNX Runtime does (Python, Node, server, mobile). The interactive demo at lexedit.ai/lexedit-privacy is exactly this: in-browser, nothing uploaded.
Speed (rough)
Quantized ONNX, measured on a laptop CPU (Apple Silicon):
| input | 1 core | all cores |
|---|---|---|
| short sentence (~30 tokens) | ~18 ms | ~13 ms |
| full chunk (~500 tokens) | ~0.2 s | ~0.1 s |
≈ 4–5 chunks/second single-threaded natively. In the browser (WASM,
single-threaded) it is slower but still practical: short text stays interactive
(well under a second), a 1–2 page document takes a few seconds, and a large
scanned document (tens of thousands of characters) can take ~a minute or two.
Enabling cross-origin isolation (COOP/COEP → SharedArrayBuffer → multi-threaded
WASM) speeds the browser path up considerably.
Recommended production setup
This model is recall-first and is meant to be one layer of a pipeline, not the whole solution. For best anonymisation results, pair it with:
- A recall-first threshold for
PER(flip a token to PER when the summed PER probability ≥ ~0.2, even if it is not the arg-max) — anonymisation wants recall ≫ precision. - A deterministic document-safety post-pass — snap predicted spans to whole words, merge hyphenated surnames, and propagate a detected surname to its other inflected / OCR-variant mentions across the document.
- Checksum-validated regular expressions for structured PII (PESEL, NIP, REGON, IBAN, passport / ID numbers) — these are better handled deterministically than by the model.
- Human review for anything high-stakes.
Evaluation
Identity-level leak rate = a person is "leaked" if any mention of them is missed (the metric that matters for anonymisation). Measured on a small internal set of 50 real Polish legal documents (179 persons), recall-first threshold 0.2.
| Document type | metric | this model | non-OCR baseline |
|---|---|---|---|
| Clean / digital | leak (threshold) | 11.7% | 10.1% |
| Clean / digital | leak (+ post-pass) | 7.3% | 7.3% |
| Scanned / OCR'd | leak (threshold) | 20.7% | 31.8% |
| Scanned / OCR'd | leak (+ post-pass) | 15.1% | 25.7% |
Token-level test F1 ≈ 0.94.
Takeaway: on scanned / OCR'd documents this variant cuts the person-name leak by ~35–41 % relative vs a model trained without OCR augmentation, at the cost of a small (~2–3 pt) precision drop on clean digital text. If you process both clean and scanned documents, consider routing scanned/photographed uploads to this model and clean digital documents to a non-OCR-augmented variant.
Training data
Fine-tuned on Polish legal and administrative documents — a mix of document templates, programmatically-generated labelled examples (with valid, checksum- correct synthetic identifiers and rule-based Polish name declension), and real-world legal-document samples. A render → OCR augmentation pass renders person names with varied fonts and scan-style degradation, runs them through OCR engines, and injects the resulting realistic OCR-corrupted name surfaces (with exact labels) as a small (~3 %) fraction of the training set.
No raw personal data is distributed with this model. Because this is a token-classification model (it outputs a label per input token and cannot generate text), the weights do not reproduce or expose any training document.
Limitations
Concrete, reproducible failure cases: see
examples/KNOWN_LIMITATIONS.md(synthetic inputs) — heavily OCR-garbled names can be missed entirely or only partially caught.
- Polish only.
- Not a guarantee. A residual leak rate remains (≈7–15 % identity-level on the internal eval); always combine with the deterministic post-pass + checksum regex above and human review for high-stakes use.
- OCR-robust trade-off: slightly lower precision on clean digital text than a non-OCR-augmented model — prefer document-type routing if both kinds occur.
- Small evaluation set; numbers are indicative, not a benchmark.
- Not legal advice; not a substitute for a privacy/compliance review.
License — CC BY 4.0 (attribution required, commercial use allowed)
Released under Creative Commons Attribution 4.0. You may use, modify and redistribute this model — including in commercial products — provided you give appropriate credit. Attribution is required in any use, commercial or not; no other restrictions are added.
Suggested attribution:
Polish legal NER / anonymisation model by lexedit (https://lexedit.ai), licensed CC BY 4.0, fine-tuned from HerBERT (
allegro/herbert-base-cased).
This model is a derivative of HerBERT (Allegro), which is itself CC BY 4.0 — please retain attribution to the base model as well.
- Downloads last month
- 24
Model tree for lexedit/herbert-polish-legal-ner-ocr
Base model
allegro/herbert-base-casedEvaluation results
- Token-level F1 (held-out test) on Internal Polish legal documents (identity-level eval)self-reported0.940