Instructions to use ZenSystemAI/ossredact-pii-large with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ZenSystemAI/ossredact-pii-large with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="ZenSystemAI/ossredact-pii-large")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("ZenSystemAI/ossredact-pii-large") model = AutoModelForTokenClassification.from_pretrained("ZenSystemAI/ossredact-pii-large") - Notebooks
- Google Colab
- Kaggle
OSSRedact PII NER -- xlm-roberta-large (GPU tier)
A bilingual (French-Québec + English) PII / secrets token-classifier -- the high-capacity detection model behind OSSRedact, a local privacy gateway that redacts private data before it reaches a cloud LLM and rehydrates it on the reply.
Shipping revision: v11r9c. The smaller always-on tier is
ZenSystemAI/ossredact-pii-base (dynamic-INT8 ONNX, also the
in-browser tier).
Top: the v11r9c gain on the synthetic held-out corpus. Bottom: a historical (v6/v7) recall comparison vs Microsoft Presidio on Québec FR/EN PII -- OSSRedact wins recall by 17-23 points with far fewer false positives.
What it is
A xlm-roberta-large token classifier fine-tuned to tag 20 PII / secret entity types in realistic French-
Québec and English documents. It is the GPU/large detection tier of OSSRedact; in production it runs inside
the OSSRedact gateway, which pairs it with a deterministic Tier-0 floor (regex + Luhn + entropy) and AES-GCM
session rehydration. The model runs locally on-device -- no detection call leaves the machine.
The bilingual Québec-French focus is the differentiator: general English-first PII detectors miss FR structure (NEQ, RAMQ, SIN, FR letterhead, accented ALL-CAPS names).
Labels (20)
account_number, address, card_cvv, card_expiry, date_of_birth, email, file_path, government_id,
iban, ip_address, organization, password, payment_card, person, phone_number, postal_code,
secret, sensitive_account_id, tax_id, username (41 BIO label ids).
Intended use
- Primary: the detection tier inside the OSSRedact gateway (redact-on-egress / rehydrate-on-response for cloud LLM traffic), or any local PII-redaction pipeline.
- Also: document de-identification, DLP, privacy review of FR/EN text.
Use it with a deterministic floor. As a standalone NER model, recall is below 100% and
organizationandaddresshave no fallback. OSSRedact gets its hard guarantee from a Tier-0 floor (secrets, payment cards via Luhn, IBANs, government IDs, emails, IPs, file paths) that runs independently of this model. Do not rely on the model alone for the catastrophic categories.
Quick start
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
tok = AutoTokenizer.from_pretrained("ZenSystemAI/ossredact-pii-large", revision="v11r9c")
model = AutoModelForTokenClassification.from_pretrained("ZenSystemAI/ossredact-pii-large", revision="v11r9c").eval()
text = "Contactez Marie-Eve Tremblay au 514-555-0188; NAS 046 454 286."
enc = tok(text, return_offsets_mapping=True, return_tensors="pt")
offsets = enc.pop("offset_mapping")[0].tolist()
with torch.no_grad():
pred = model(**enc).logits[0].argmax(-1).tolist()
for (s, e), p in zip(offsets, pred):
lab = model.config.id2label[p]
if s != e and lab != "O":
print(f"{text[s:e]!r:24} -> {lab}")
Training data
A 100% synthetic French-Québec + English corpus (bank statements, financing forms, email threads, CSV
exports, .env files, code, KYC/tax/SAAQ/RAMQ documents). Every name, SIN, account, card, and secret is
fabricated, so the corpus can be regenerated and re-run anywhere with no real-data exposure. It deliberately
includes adversarial cases (ALL-CAPS, NBSP-separated IDs, mixed FR/EN, long unbroken lines, look-alike decoys,
names glued into code identifiers). The corpus is cumulative across the v11 error-mining rounds (base + the
structural-name and organization/address augmentations).
Evaluation
Measured on a synthetic held-out corpus (7,498 rows, 0 train overlap, unseen document structures). The privacy
metric is full-stack catastrophic DETECTION recall -- any detected span is redacted regardless of which
label it gets, so an intra-catastrophic mislabel is a redaction, not a leak. clean_fp is over-redaction count
on no-PII rows.
| tier | catastrophic full-stack DETECTION | all-label recall | precision | clean_fp |
|---|---|---|---|---|
| GPU / large (v11r9c) | 0.9954 | 0.9882 | 0.9615 | 34 / 7498 |
All-label F1 0.9742. Of the 13 catastrophic categories, email / iban / secret / password / file_path / tax_id /
card_expiry / card_cvv / government_id / postal_code / date_of_birth / ip_address / payment_card all detect at
1.000; person 0.9946 (precision 0.9999), sensitive_account_id 0.9993. Organization 1.00, address
0.95 (v11r9c closed the structural-form leak the prior revision had: organization ~0.10 -> 1.00, address
~0.60 -> 0.95). FR is not weaker than EN. The cost of the org/address fix is more over-redaction on digit-ID-
shaped tokens (clean_fp 12 -> 34) -- the safe failure direction (over-redaction never leaks).
Training recipe: batch size 8, learning rate 2e-5, max length 512, 3 epochs, metric_for_best_model=cat_f1
(checkpoint maximizes recall on the catastrophic-leak labels). 559 M params.
Limitations
- Trained and validated entirely on synthetic Québec data; broader real-world domains are future work.
- French and English only by design.
organizationandaddresshave no deterministic floor -- they rely entirely on this model (well-covered on the synthetic corpus, but model-dependent, not a hard guarantee).- Identifier coverage targets Canadian / Québec formats (SIN, RAMQ, NEQ, postal codes). Foreign formats (US ZIP, Brazilian CPF) are not specifically targeted.
- Full names glued into code identifiers (camelCase / snake_case) are under-detected.
- Recall is below 100%; use within OSSRedact's deterministic floor for the catastrophic categories.
License & links
MIT. Part of OSSRedact by ZenSystemAI. The version label v11rN
is the weight revision (an HF revision tag), not part of the repo id.
- Downloads last month
- -
Model tree for ZenSystemAI/ossredact-pii-large
Base model
FacebookAI/xlm-roberta-large

