kp-deid-mdeberta-280m

A KlusAI Privacy (KP) de-identification model — a multilingual PII/PHI token classifier emitting the harmonized KP BIOES taxonomy. Part of the EuroPriv-Bench program. First model of the kp-deid xlmr-ner family.

Status: full multilingual run (KLU-44). This is the full-data LoRA finetune on all three live general-text datasets (RO + EN + PL, 150k examples), 3 epochs, on the Mac Studio GPU (Metal/MPS, KLU-45), with a small held-out hyperparameter sweep. It supersedes the earlier bounded 4k-example CPU smoke checkpoint. Scores are still framed as an open head-to-head delta on the contamination-free RO real-skeleton, never "SOTA"; the RO track stays clean_held_out (no model on the board was trained on it) and dev until the KLU-27 native-speaker / IAA sign-off.

Model Details

Property	Value
Task	Token classification (PII/PHI detection), BIOES
Base model	`microsoft/mdeberta-v3-base` (280M)
Method	LoRA (`r=16`, `lora_alpha=32`, `target_modules=query_proj/key_proj/value_proj`, `TaskType.TOKEN_CLS`), merged into the base
Languages	Romanian (ro), English (en), Polish (pl)
Domain	general / legal / clinical / admin (multilingual mix)
Taxonomy	Harmonized KP (GDPR-aligned crosswalk), `europriv_bench.taxonomy.bioes_labels()`
Device / backend	transformers + peft on the Mac GPU (Metal/MPS, KLU-45); CPU is the guaranteed fallback. MLX is N/A for this family (KLU-11) — no `-mlx` variant
Training data	`klusai/ds-kp-general-{ro,en,pl}-50k` (150,000 examples; 145,500 train / 4,500 held-out eval)
Epochs	3
Chosen hyperparameters	lr=3e-4, LoRA r=16 (selected via the sweep below; see the KLU-54 caveat — eval-loss is not a quality signal)

Hyperparameter sweep

A small sweep (LR × LoRA-r) on a fixed 30k multilingual subset, 2 epochs each, picked by eval-loss on 4,500 examples:

lr	LoRA r	eval_loss
3e-4	16	0.000020 (best)
2e-4	16	0.000037
2e-4	32	0.000029

The best config (lr=3e-4, r=16) was then retrained on the full 145,500-example training split for 3 epochs. Total wall-clock on MPS: ~50 min (sweep + final run).

⚠️ These eval-loss numbers are NOT a quality signal (KLU-54). This run used a leaky eval split — eval was a shuffled head of the same generator corpus, sharing all 6 sentence templates with train, so eval-loss measured memorization, not generalization (hence the implausibly low final_eval_loss ~7.2e-10). The training pipeline now uses a template-disjoint held-out split (template_disjoint_split; see docs/klu-54-eval-split.md), under which eval-loss lands in a plausible band (0.23 on a matched short run). Model quality is measured only by the EuroPriv-Bench harness scores below, which are unaffected. A re-run of this published checkpoint under the corrected split is a follow-up.

Evaluation

Scored on EuroPriv-Bench ro-realskeleton-v1 (the citable, contamination-free Romanian real-structure track) via the harness kp-model adapter — entity F1 / recall-weighted F2 plus CNP re-identification leakage with 95% Wilson confidence intervals. Numbers are filled into the program leaderboard (baselines/leaderboard-kp-realskeleton.json) with full provenance (harness + taxonomy + dataset revisions).

Scored on ro-realskeleton-v1 (n=1500; contamination=clean_held_out, config_status=dev; europriv-bench 0.2.0 / taxonomy 0.2.0):

Metric	Full multilingual run (this model)	4k-RO CPU smoke baseline
Entity F1 (P / R)	0.741 (0.686 / 0.805)	0.683 (0.642 / 0.730)
Entity F2 (recall-weighted)	0.778	0.710
CNP leak-rate (95% Wilson CI)	0.000 (0.000–0.0034); 1123/1123 detected	0.000 (0.000–0.0034); 1123/1123

The full multilingual run lifts entity-F1 by +5.8 points (driven by +7.5 recall and +4.5 precision) over the smoke checkpoint while holding CNP re-identification leakage at 0.0% (all 1123 valid CNPs redacted). Framed as an open head-to-head delta on the contamination-free RO real-skeleton, never "SOTA".

Intended Use & Limitations

Research de-identification for Romanian / English / Polish general / legal / clinical / administrative text. Trained only on synthetic-PII general text; do not deploy as-is. Long alphanumeric IDs (IBAN-style ACCOUNT_ID) can still over-fragment at the span boundary — the main F1 limiter. Always use behind a governance layer (human review / deterministic pre-filters such as CNP/IBAN validators). Not a substitute for legal compliance review.

Citation

@misc{klusai_europriv_2026,
  title  = {EuroPriv-Bench: A Unified Pan-European De-identification Benchmark},
  author = {KlusAI},
  year   = {2026}
}

Related Artifacts

Artifact	HF ID
Benchmark	`klusai/europriv-bench`
Training data	`klusai/ds-kp-general-{ro,en,pl}-50k`
SDK	`klusai-privacy` (extract_pii / deidentify / pseudonymize)

Downloads last month: 109

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for klusai/kp-deid-mdeberta-280m

Base model

microsoft/mdeberta-v3-base

Finetuned

(284)

this model

klusai
/

kp-deid-mdeberta-280m