pii-ner-model

Dynamic-INT8 ONNX export of akdeniz27/bert-base-turkish-cased-ner (BERTurk, MIT). It detects free-text PII — names and addresses — that a deterministic regex masker can't catch, and runs in-process via onnxruntime (no torch).

Freya's voice agent loads it for freeform-PII redaction (src/privacy/ner.py, LocalPiiDetector); the agent image fetches this repo at build into PII_NER_MODEL_DIR. NER is optional + fail-open and gated per-agent by privacy_config.mask_pii.

Files

file	what
`model.int8.onnx`	dynamic-INT8-quantized BERTurk token-classification model (~106 MB)
`tokenizer.json`	Rust-tokenizer config for the `onnxruntime` path
`config.json`	`id2label` map for decode
`export_model.py`	the offline recipe that produced the artifacts (not used at runtime)

Labels

7-class BIO: O, B-PER/I-PER, B-ORG/I-ORG, B-LOC/I-LOC. Downstream mapping: PER -> NAME, LOC -> ADDRESS; ORG is dropped.

Quality

Validated on Turkish: names F1 ~1.00 (cased) / ~0.93–0.95 (ASR-style lowercase). INT8 is effectively lossless vs fp32 on cased text. Addresses (LOC) are weaker on lowercase ASR text.

Regenerating

Needs torch + optimum[onnxruntime] (not runtime deps):

pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install "optimum[onnxruntime]" transformers
python export_model.py --model akdeniz27/bert-base-turkish-cased-ner --out /tmp/pii-ner
# then copy model_quantized.onnx -> model.int8.onnx, plus tokenizer.json + config.json

License

MIT — same as the base model. See LICENSE. Base model: akdeniz27/bert-base-turkish-cased-ner.

Downloads last month: 46

Model tree for freyavoice/pii-ner-model

Base model

akdeniz27/bert-base-turkish-cased-ner

Quantized

(1)

this model