qwen-2.5-1.5b-de-pii-redactor

A small, fast, DSGVO-konform PII redactor for German business documents — emails, support tickets, CRM notes, contracts, incident reports. Emits structured JSON (not raw spans) with a redacted text, a list of detected entities, a risk level, and a human-review flag. The entity catalog is built around German-specific identifiers (Steuer-ID, USt-IdNr, IBAN, Sozialversicherungsnummer, Handelsregisternummer, Kfz-Kennzeichen, Krankenversichertennummer) in a business context — deliberately not clinical, not span-only.

Why another PII model?

Generic multilingual PII detectors (GLiNER-PII, OpenMed-PII) are strong but either (a) span-only, (b) English-centric with German as an afterthought, or (c) clinical-only. This adapter fills the gap for German business documents with three specific design choices:

  1. German-specific identifier coverage. USt-IdNr, Steuer-ID, Sozialversicherungsnummer, Krankenversichertennummer, Handelsregisternummer, Kfz-Kennzeichen, IBAN — the IDs a German DSGVO-Audit actually asks about.
  2. Structured output contract. Pydantic-validated RedactionResult with redacted text, entity list, risk_level and needs_human_review — the shape downstream pipelines want to consume. No post-hoc span-reconstruction.
  3. Small enough to self-host. Qwen2.5-1.5B + LoRA adapter fits on a consumer 24 GB card in 4-bit; on an A40 the bf16 path is fast. This is the model you run on-prem when your compliance team says "no customer PII leaves the building".

Entity catalog

Type What
PERSON Vor- und Nachname einer natürlichen Person
ADDRESS Postanschrift (Straße, PLZ, Ort)
EMAIL E-Mail-Adresse
PHONE Telefonnummer (mobil/Festnetz, international/national)
DOB Geburtsdatum
IBAN IBAN (DE + international)
BIC Bank Identifier Code
TAX_ID Steuer-Identifikationsnummer (11 Stellen, §139b AO)
VAT_ID USt-IdNr (DE + 9 Stellen)
SSN_DE Sozialversicherungsnummer (12 Zeichen)
HEALTH_INSURANCE Krankenversichertennummer (10 Zeichen)
ID_CARD Personalausweis- oder Reisepass-Nummer
LICENSE_PLATE deutsches Kfz-Kennzeichen
COMMERCIAL_REGISTER Handelsregisternummer (HRA/HRB)
IP_ADDRESS IPv4 / IPv6
CUSTOMER_ID interne Kunden-, Bestell- oder Ticket-ID

Output schema

from pydantic import BaseModel
from typing import Literal

EntityType = Literal[
    "PERSON", "ADDRESS", "EMAIL", "PHONE", "DOB",
    "IBAN", "BIC", "TAX_ID", "VAT_ID", "SSN_DE",
    "HEALTH_INSURANCE", "ID_CARD", "LICENSE_PLATE",
    "COMMERCIAL_REGISTER", "IP_ADDRESS", "CUSTOMER_ID",
]

class PiiEntity(BaseModel):
    type: EntityType
    value: str            # original text
    replacement: str      # e.g. "[PERSON_1]"

class RedactionResult(BaseModel):
    redacted_text: str
    entities: list[PiiEntity]
    risk_level: Literal["low", "medium", "high"]
    needs_human_review: bool

Training data

  • N: 75 train + 16 eval synthetic German business documents
  • Generated by: Claude Opus, sampled across 6 document types (support email, CRM note, HR email, contract snippet, ops incident, legal intake) × 16 entity-combination templates
  • Validated by: RedactionResult.model_validate() plus span-level sanity checks (every entity.value must occur in the input, every entity.replacement must occur in redacted_text); failures dropped
  • Open-source by design: all data is synthetic with fictional identifiers so the full corpus + training harness can be published without exposing real PII

Training setup

  • Base: Qwen/Qwen2.5-1.5B-Instruct (Apache 2.0, no gating)
  • Method: LoRA (PEFT) via TRL SFTTrainer, conversational chat format
  • LoRA config: r=32, alpha=64, dropout=0.05, target_modules = attention + MLP projections (q/k/v/o, gate/up/down)
  • Optimiser: AdamW (torch), cosine schedule, warmup ratio 0.03, learning rate 4e-4
  • Batch: 4 per device × 4 grad-accum × bf16 × gradient checkpointing
  • Epochs: 8
  • Max seq len: 2048
  • Hardware: single NVIDIA A40 (48 GB) on RunPod
  • Wall time: 7 minutes
  • GPU cost: about USD 0.05 for the training run

Evaluation

Computed on a held-out eval split via scripts/eval.py:

Metric Base Qwen2.5-1.5B + LoRA adapter
Schema-valid JSON output 81.2% 100.0%
Entity micro-F1 (type + value) 0.38 0.92
Risk-level exact match 50.0% 93.8%
Needs-review exact match 68.8% 68.8%

The fine-tune locks down the schema layer and the German-specific identifier recall at this small data scale. The free-form redacted text quality starts at usable and improves fast on real domain data — on a client engagement the same recipe runs against 2,000-10,000 of your actual documents, which closes the long tail of ambiguous names and rare identifier formats.

How to use

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = "Qwen/Qwen2.5-1.5B-Instruct"
adapter = "renezander030/qwen-2.5-1.5b-de-pii-redactor"

tok = AutoTokenizer.from_pretrained(base)
mdl = AutoModelForCausalLM.from_pretrained(base, device_map="auto")
mdl = PeftModel.from_pretrained(mdl, adapter).merge_and_unload()

# Same system prompt the adapter was trained on.
# See schema.py (render_system) for the exact text.
SYSTEM = "<redactor system prompt>"
USER = """Sehr geehrte Frau Schmidt,
anbei die Überweisung Ihrer Erstattung auf
IBAN DE89 3704 0044 0532 0130 00, Betrag 249,90 EUR.
Kundennummer: K-884421.
Bei Rückfragen erreichen Sie mich unter +49 30 12345678
oder j.weber@example.de.
Beste Grüße, Julia Weber"""

messages = [
    {"role": "system", "content": SYSTEM},
    {"role": "user", "content": USER},
]
prompt = tok.apply_chat_template(messages, tokenize=False,
                                 add_generation_prompt=True)
inputs = tok(prompt, return_tensors="pt").to(mdl.device)
out = mdl.generate(**inputs, max_new_tokens=1200, do_sample=False)
print(tok.decode(out[0][inputs["input_ids"].shape[1]:],
                 skip_special_tokens=True))

Expected output shape:

{
  "redacted_text": "Sehr geehrte Frau [PERSON_1], ... IBAN [IBAN_1] ... Kundennummer: [CUSTOMER_ID_1] ... +49 [PHONE_1] oder [EMAIL_1]. Beste Grüße, [PERSON_2]",
  "entities": [
    {"type": "PERSON", "value": "Schmidt", "replacement": "[PERSON_1]"},
    {"type": "IBAN", "value": "DE89 3704 0044 0532 0130 00", "replacement": "[IBAN_1]"},
    {"type": "CUSTOMER_ID", "value": "K-884421", "replacement": "[CUSTOMER_ID_1]"},
    {"type": "PHONE", "value": "30 12345678", "replacement": "[PHONE_1]"},
    {"type": "EMAIL", "value": "j.weber@example.de", "replacement": "[EMAIL_1]"},
    {"type": "PERSON", "value": "Julia Weber", "replacement": "[PERSON_2]"}
  ],
  "risk_level": "high",
  "needs_human_review": false
}

Deployment notes

  • Footprint: ~140 MB safetensors adapter + ~11 MB tokenizer; ships as a one-folder plugin on top of any Qwen2.5-1.5B host.
  • 4-bit (bitsandbytes nf4) brings the merged inference footprint to ~1 GB on a 24 GB consumer GPU; on A40 / A100 the bf16 path is faster.
  • Batch throughput: swap transformers.generate() for vLLM with the LoRA adapter loaded; expect 5-10× throughput for incoming ticket queues.
  • Pydantic validation at the boundary makes downstream pipelines fail-fast on schema drift. Pair with a simple re-ask on validation failure.
  • On-prem: no data leaves your infrastructure. This is the point of running a small redactor instead of routing everything through a frontier API.

Limitations

  • Small synthetic training set — schema layer and German-specific identifier recognition lock in at this size, but long-tail names, compound German surnames, and rare format variants need real data.
  • Synthetic share is 100% in this open-source release; real business documents will expose failure modes around homonymy (Person vs. product name) and institutional identifiers that share formats with PII (e.g. contract numbers that look like customer IDs).
  • Redaction is type + value + replacement, not character offsets. If your pipeline requires offsets, reconstruct them from value occurrences in the input text.
  • The model sees 2048 tokens at a time. For long contracts, chunk with a small overlap and merge entity lists.
  • Not a compliance certification. This adapter helps your pipeline redact consistently and fast; your DPO still owns the legal call.

Work with me

This adapter is a public reference of the recipe I deliver to freelance clients: small, fast, GDPR-clean, structured-output LLMs trained on the domain data you already have.

If you need one of these, I can build it:

  • a PII redactor trained on your own German documents (support tickets, CRM, contracts, medical, legal) for higher recall on your actual terminology
  • a private LLM deployment on your infrastructure, or a dedicated cloud GPU endpoint
  • a structured-output agent pipeline (LangGraph, Pydantic-validated, human-in-the-loop routing)
  • an evaluation harness that tells you when the model is actually good enough to ship to production

Two ways to engage:

License

Adapter weights: Apache 2.0 (matches the Qwen2.5 base). Training scripts in the companion repo: MIT.

Citation

@misc{zander2026qwen15bdepiiredactor,
  author       = {Zander, Rene},
  title        = {qwen-2.5-1.5b-de-pii-redactor: a LoRA adapter
                  for DSGVO-konform PII redaction of German
                  business documents with structured JSON output},
  year         = {2026},
  howpublished = {HuggingFace Hub},
  url          = {https://huggingface.co/renezander030/qwen-2.5-1.5b-de-pii-redactor},
}
Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for renezander030/qwen-2.5-1.5b-de-pii-redactor

Adapter
(1033)
this model