gheim

gheim-ch-560m

A multilingual token-classification model for personally-identifiable information (PII) detection across the four official Swiss languages (de_CH, fr_CH, it_CH, rm) and English. The model is a fine-tune of FacebookAI/xlm-roberta-large on joelbarmettler/gheim-ch-pii-171k. Output schema is a 33-class BIOES tag set (8 PII categories plus the outside class) aligned with the categorical naming used by openai/privacy-filter.

Parameters 560M
Languages de_CH, fr_CH, it_CH, rm, en
Categories account_number, private_address, private_date, private_email, private_person, private_phone, private_url, secret
Tag scheme BIOES (33 classes)
Max sequence length 512
License Apache 2.0

Full report: the curation pipeline, training procedure, comparison against seven other PII / NER systems, cross-domain results on four external benchmarks, methodology validation, and an extended related-work section are documented in paper/paper.pdf (arXiv preprint forthcoming). This card is the deployment-facing summary.

Intended use

The model classifies character-level spans of PII so that text can be redacted prior to transmission to systems where personal data should not appear (for example, third-party LLM APIs hosted outside the data subject's jurisdiction). Output spans are intended for substitution or masking, not for entity linking or re-identification. The training data follows a recall-oriented labelling policy under which publicly-listed institutional information (e.g. court switchboard numbers, parliament email addresses, public-official names) is flagged as PII. Applications requiring stricter precision should pair model output with downstream filtering.

Usage

Recommended: gheim SDKs (round-trip with sentinel restoration)

For the typical use case — anonymise text, send to an LLM, restore the originals on the way back — install the gheim Python or gheim npm package. This model is the default detector in both, and the wrappers handle sentinel allocation, streaming-aware decode, multi-turn coherent sessions, and a drop-in OpenAI client.

pip install "gheim[local,openai]"        # Python
npm install gheim openai @huggingface/transformers   # JS / TS
# Python — drop-in OpenAI client. Defaults to gheim-ch-560m.
from gheim.openai import OpenAI

client = OpenAI()  # accepts the same kwargs as openai.OpenAI
r = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user",
               "content": "Hi, my name is Joel. My phone is +41 44 268 12 34."}],
)
# r.choices[0].message.content has the original PII restored.
# OpenAI only ever saw "<PERSON_1>" and "<PHONE_1>".
// JS / TS — same idea.
import { OpenAI } from "gheim/openai";

const client = new OpenAI();  // accepts the same opts as openai's OpenAI
const r = await client.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user",
               content: "Hi, my name is Joel. My phone is +41 44 268 12 34." }],
});

Streaming, async, tool calls, and 9 other text-carrying endpoints (responses, embeddings, moderations, audio.*, images.*) are wrapped automatically. Full surface in the package READMEs: Python · JS.

Alternative: raw transformers / transformers.js

If you only need a token classifier (no sentinel round-trip), use the HuggingFace pipelines directly.

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

repo = "joelbarmettler/gheim-ch-560m"
tok = AutoTokenizer.from_pretrained(repo)
mdl = AutoModelForTokenClassification.from_pretrained(repo)
ner = pipeline("token-classification", model=mdl, tokenizer=tok,
               aggregation_strategy="simple")

text = ("Bitte überweisen Sie an Müller AG, IBAN CH9300762011623852957, "
        "Werdstrasse 36, 8004 Zürich.")
for span in ner(text):
    print(f"{span['entity_group']:<18} {span['score']:.2f} {span['word']!r}")
// Node / Bun / browser via @huggingface/transformers (transformers.js).
// The Hub repo only ships the int8 ONNX, so pass dtype: "q8".
import { pipeline } from "@huggingface/transformers";

const ner = await pipeline("token-classification",
  "joelbarmettler/gheim-ch-560m",
  { aggregation_strategy: "simple", dtype: "q8" });
const out = await ner("Email me at alice@example.ch, phone +41 44 268 12 34.");

Performance

Strict-span F1 (seqeval) on the held-out test split of joelbarmettler/gheim-ch-pii-171k (15,861 chunks, document-isolated from the training split for the real-text portion). The test set was scored once.

Metric Test Validation
F1 0.916 0.918
Precision 0.907 0.908
Recall 0.926 0.926

Per-language × per-category char-level F1 on the same test split. Body cells are char-level F1 for each (language, category) pair; the right-most column gives the per-category average over languages (gold-weighted), and the bottom row gives the per-language average over categories. The bottom-right cell is the overall char F1.

Category de_ch fr_ch it_ch rm en Avg.
account_number 0.932 0.717 0.660 0.169 0.994 0.765
private_address 0.890 0.870 0.915 0.825 0.973 0.889
private_date 0.943 0.934 0.952 0.888 0.909 0.939
private_email 0.988 0.994 0.999 0.992 0.999 0.994
private_person 0.938 0.948 0.962 0.897 0.951 0.944
private_phone 0.989 0.985 0.993 0.995 0.997 0.990
private_url 0.992 0.994 0.994 0.958 0.980 0.992
secret 0.999 1.000 0.999 1.000 1.000 1.000
Avg. 0.954 0.953 0.972 0.923 0.981 0.958

For the full comparison against seven other open PII / NER systems on the same Swiss test set, cross-domain evaluation on four external benchmarks (AI4Privacy openpii-1m, ZurichNLP swissner, CoNLL-2003, Babelscape WikiNeural), and the methodology-validation reproductions of each baseline's published numbers, see paper/paper.pdf §4 and the machine-readable matrix at eval/positioning_matrix.json.

Deployment formats

The model is published in two formats:

  • model.safetensors (root): fp32 PyTorch checkpoint, 2.2 GB, intended for server-side inference via transformers.
  • onnx/model_quantized.onnx: int8 dynamic-quantised ONNX export, 552 MB, intended for in-browser inference via @huggingface/transformers on WebGPU or WebAssembly. Selected with dtype: "q8".
Format Size Test F1 Δ vs fp32
PyTorch fp32 2.2 GB 0.916 (baseline)
ONNX int8 (dynamic) 552 MB 0.909 -0.7 pp

The int8 export loses 0.7 pp overall F1 relative to the PyTorch checkpoint; the loss is concentrated in private_address (-2.1 pp) and account_number (-5.3 pp), the two categories that were already weakest under fp32. All five languages are affected uniformly within ±0.001 pp of the overall drop. Per-category and per-language quantisation deltas are tabulated in paper/paper.pdf §3.3.

Training procedure

Selected from a controlled bake-off against ZurichNLP/swissbert (270M dense), each model receiving an identical 5 × 3 sweep over (learning rate, layer-wise LR decay) at 1 epoch. The winning configuration per base model was trained for 3 full epochs and selected by best validation F1. xlm-roberta-large won the bake-off (val F1 0.918 vs swissbert's 0.910). Selected configuration: AdamW, LR 5e-5 cosine with 5% warmup, no LLRD, effective batch 128 (per-device 64 × 2 GPUs DDP), bf16, 3 epochs, max sequence length 512. Best checkpoint at epoch 2.50 by validation overall_f1. Wall time ≈ 52 min train + 4 min eval on 2 × RTX 4090. Full procedure including the hyperparameter sweep results is in paper/paper.pdf §3.

The training data was the train split of gheim-ch-pii-171k (139,641 chunks) plus an English-anchor and Swiss-region email rescue slice from ai4privacy/pii-masking-openpii-1m (≈ 14,000 chunks); validation and test contain no AI4Privacy data.

Limitations

  1. Recall-oriented labelling policy. The model inherits the dataset's policy of flagging publicly-listed institutional contact information. Applications needing stricter precision should apply downstream filtering or a private-vs-public-entity post-classifier.
  2. private_address test F1 is 0.78. Boundary placement on multi-token addresses is the dominant error mode.
  3. account_number test F1 is 0.67. For production use, pair the model with the regex front-end documented in the gheim library, which applies checksum validation (IBAN, AHV, VAT-CHE, Luhn).
  4. Romansh test F1 is 0.85, the weakest of the five languages. The RM training material is dominated by a single literary/journalistic register; performance on dialectal or technical RM text is unmeasured.
  5. Swiss German dialect (GSW) is not measured. The fasttext detector used in data preparation labels GSW as standard German.
  6. Re-identification is not in scope. The model is intended for redaction; it does not return entity-linked identifiers.

License

Apache 2.0, inherited from the base model FacebookAI/xlm-roberta-large. The training data (joelbarmettler/gheim-ch-pii-171k) is released under CC BY 4.0; attribution to its upstream corpora (the swiss-ai/apertus-pretrain-* datasets) is required when reusing the data.

Citation

@misc{barmettler2026gheim_ch_560m,
  title  = {gheim-ch-560m: A multilingual PII detection model for the Swiss market},
  author = {Joel Barmettler},
  year   = {2026},
  url    = {https://huggingface.co/joelbarmettler/gheim-ch-560m}
}

If the model is used in published work, please also cite the dataset:

@misc{barmettler2026gheim_ch_pii,
  title  = {gheim-ch-pii-171k: A Swiss-grounded PII NER dataset with synthetic gap-fill},
  author = {Joel Barmettler},
  year   = {2026},
  url    = {https://huggingface.co/datasets/joelbarmettler/gheim-ch-pii-171k}
}

Maintainer

Joel Barmettler · jbarmettler@proton.me · joelbarmettler.xyz · github.com/joelbarmettlerUZH/gheim

Source code, issue tracker, and the wider gheim ecosystem (Python and Node libraries, redaction server, composite detector) are at github.com/joelbarmettlerUZH/gheim.

Downloads last month
227
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for joelbarmettler/gheim-ch-560m

Quantized
(5)
this model

Dataset used to train joelbarmettler/gheim-ch-560m

Evaluation results