Multilingual DialogPII NER

A fine-tuned jhu-clsp/mmBERT-base model with a CRF layer for Personally Identifiable Information (PII) detection in multilingual dialogues across 11 languages.

Model Description

This model performs token-level Named Entity Recognition (NER) to identify and classify PII entities in dialogue text. It was trained on synthetic multilingual de-identification of conversational data.

  • Architecture: mmBERT-base (ModernBERT) + CRF head with FLERT context windowing
  • Training: Fine-tuned on all 11 languages jointly (multilingual training) using FLERT-style document context
  • Loss: Cross-Entropy
  • Hyperparameters: lr=2e-05, batch_size=32, max_length=2048, dropout=0.1, epochs=10
  • Context window: 2 sentences left + 2 sentences right, separated by [SEP] markers
  • Decoding: Viterbi decoding via CRF layer

Supported Languages

Code Language
AR Arabic
DE German
EN English
FI Finnish
FR French
HI Hindi
IT Italian
PL Polish
PT Portuguese
SP Spanish
TR Turkish

Entity Types

The model recognizes 19 PII entity types using BIO tagging:

Entity Description
PERSON Person names
PERSON_EMAIL Email addresses
PERSON_SOCIAL_RELATION Social relations (e.g., "my wife")
ORG Organizations
LOC_CITY Cities
LOC_COUNTRY Countries
LOC_STREET Street names
LOC_ZIP ZIP/postal codes
LOC_HOUSENUMBER House numbers
LOC_OTHER Other locations
DATETIME Dates and times
DATETIME_AGE Ages
CODE ID numbers, reference codes
CODE_PHONE Phone numbers
CODE_URL URLs
PROFESSION Professions
PRODUCT Product names
QUANTITY Quantities
MISC Miscellaneous PII

Performance

Evaluated on held-out test sets per language (type-aware micro scores):

Language Len P Len R Len F1 Len F2 Ex P Ex R Ex F1 Ex F2
AR 87.87 73.15 79.84 75.69 84.45 70.30 76.73 72.74
DE 94.12 90.66 92.36 91.33 93.33 89.90 91.58 90.56
EN 94.93 93.45 94.18 93.74 92.41 90.97 91.69 91.25
FI 91.36 88.46 89.89 89.03 89.93 87.07 88.48 87.63
FR 90.91 88.09 89.48 88.64 87.66 84.94 86.28 85.47
HI 87.55 82.33 84.86 83.33 83.37 78.40 80.81 79.35
IT 93.57 87.81 90.60 88.90 90.72 85.13 87.84 86.19
PL 90.11 90.31 90.21 90.27 87.41 87.61 87.51 87.57
PT 91.10 90.69 90.90 90.77 89.28 88.88 89.08 88.96
SP 93.06 91.47 92.26 91.79 91.30 89.74 90.51 90.05
TR 89.13 86.53 87.81 87.04 85.79 83.29 84.52 83.78
AVG 91.25 87.54 89.31 88.23 88.70 85.11 86.82 85.78

Usage

This model uses a custom CRF architecture with FLERT-style context windowing and cannot be loaded directly with AutoModelForTokenClassification. You need to use the custom ModernBertCRF class.

Note: The config.json in this repo exists solely for Hugging Face download tracking. For model loading, use crf_config.json and flert_config.json instead.

Setup

import torch
import json
import re
import torch.nn as nn
import spacy
from transformers import AutoModel, AutoTokenizer
from torchcrf import CRF
from huggingface_hub import snapshot_download

class ModernBertCRF(nn.Module):
    def __init__(self, base_model_name, num_labels, id2label, label2id):
        super().__init__()
        self.num_labels = num_labels
        self.id2label = id2label
        self.label2id = label2id
        self.transformer = AutoModel.from_pretrained(base_model_name)
        hidden_size = self.transformer.config.hidden_size
        self.classifier = nn.Linear(hidden_size, num_labels)
        self.dropout = nn.Dropout(0.1)
        self.crf = CRF(num_labels, batch_first=True)

    def forward(self, input_ids, attention_mask, labels=None, **kwargs):
        kwargs.pop("token_type_ids", None)
        outputs = self.transformer(input_ids=input_ids, attention_mask=attention_mask)
        sequence_output = self.dropout(outputs.last_hidden_state)
        emissions = self.classifier(sequence_output)
        return {"logits": emissions}

    def decode(self, emissions, mask):
        return self.crf.decode(emissions, mask=mask)

# Load model
model_dir = snapshot_download("DFKI-SLT/multilingual_DialogPII_NER")

with open(f"{model_dir}/crf_config.json") as f:
    config = json.load(f)

with open(f"{model_dir}/flert_config.json") as f:
    flert_config = json.load(f)

model = ModernBertCRF(
    base_model_name=config["base_model_name"],
    num_labels=config["num_labels"],
    id2label=config["id2label"],
    label2id=config["label2id"],
)
model.load_state_dict(torch.load(f"{model_dir}/pytorch_model.bin", map_location="cpu"))
model.eval()

tokenizer = AutoTokenizer.from_pretrained(model_dir)
id2label = {int(k): v for k, v in config["id2label"].items()}

context_window = flert_config["context_window"]     # 2
use_sep_marker = flert_config["context_sep_marker"]  # True

Preprocessing: Sentence Splitting

The model was trained using FLERT-style context windowing over sentence-level input. Each sentence is predicted with surrounding context sentences. For best results, split your input into sentences using spaCy before inference.

nlp = spacy.blank("en")          # use "de" for German, "xx" for multilingual
nlp.add_pipe("sentencizer")

def split_dialogue(text, nlp):
    sentences = []
    for line in text.strip().splitlines():
        m = re.match(r"^(SPEAKER_\d+)\s*:\s*(.*)", line.strip())
        if m:
            speaker, rest = m.group(1), m.group(2)
            sentences.append([speaker, ":"])
            line = rest
        if not line:
            continue
        doc = nlp(line)
        for sent in doc.sents:
            tokens = [tok.text for tok in sent if not tok.is_space]
            if tokens:
                sentences.append(tokens)
    return sentences

# Example
raw = """SPEAKER_00: Hello, my name is Peter.
SPEAKER_01: Hello, my name is Peter as well. Okay, and where do you come from? I come from Chicago."""

sentences = split_dialogue(raw, nlp)

Inference with FLERT Context Windowing

The key difference from standard token classification: each sentence is predicted within a window of surrounding context sentences, joined by [SEP] tokens. Only labels for the target sentence are extracted.

def predict_dialogue(sentences, model, tokenizer, id2label,
                     context_window=2, use_sep_marker=True, device="cpu"):
    sep = tokenizer.sep_token
    all_labels = []
    for i, target_tokens in enumerate(sentences):
        left  = sentences[max(0, i - context_window):i]
        right = sentences[i + 1:i + 1 + context_window]

        flat_tokens = []
        for s in left:
            flat_tokens.extend(s)
        if use_sep_marker and left:
            flat_tokens.append(sep)

        tgt_start = len(flat_tokens)
        flat_tokens.extend(target_tokens)
        tgt_end = len(flat_tokens)

        if use_sep_marker and right:
            flat_tokens.append(sep)
        for s in right:
            flat_tokens.extend(s)

        enc = tokenizer(flat_tokens, is_split_into_words=True,
                        return_tensors="pt", truncation=False).to(device)
        word_ids = enc.word_ids(batch_index=0)

        with torch.no_grad():
            emissions = model(**enc)["logits"]
            mask = enc["attention_mask"].bool()
            preds = model.decode(emissions, mask)[0]

        word_labels = ["O"] * len(target_tokens)
        seen = set()
        for idx, wid in enumerate(word_ids):
            if wid is None or wid in seen:
                continue
            seen.add(wid)
            if tgt_start <= wid < tgt_end:
                word_labels[wid - tgt_start] = id2label[preds[idx]]

        all_labels.append(word_labels)
    return all_labels


# Run prediction
results = predict_dialogue(sentences, model, tokenizer, id2label,
                           context_window=context_window,
                           use_sep_marker=use_sep_marker)

for sent_tokens, sent_labels in zip(sentences, results):
    for token, label in zip(sent_tokens, sent_labels):
        if label != "O":
            print(f"{token:20s} -> {label}")

Single-sentence inference

For isolated sentences without dialogue context, pass them with context_window=0:

tokens = ["My", "name", "is", "John", "Smith", "and", "I", "live", "in", "Berlin", "."]

results = predict_dialogue([tokens], model, tokenizer, id2label,
                           context_window=0, use_sep_marker=False)

for token, label in zip(tokens, results[0]):
    if label != "O":
        print(f"{token:20s} -> {label}")

Training Data

The model was trained on synthetic multilingual dialogue data covering various domains (medical anamnesis, customer support, police reports, therapy sessions, etc.). The data was generated and annotated as part of a thesis project on multilingual PII de-identification.

Limitations

  • Trained on synthetic dialogue data; performance on real-world data may vary
  • Optimized for dialogue/conversational text; may underperform on formal documents
  • Arabic and Hindi show lower performance compared to European languages
  • Requires pytorch-crf package for inference

Citation

If you use this model, please cite:

@misc{roller2026multilingual,
  title={DialogPII: A multilingual dataset of synthetic dialog transcripts to detect personal information},
  author={Roland Roller and Vera Czehmann and Derya Erman and Luke Flanagan and Ibrahim Baroud and Fr{\'e}d{\'e}ric Blain and Viviana Cotik and Eletta Giusto and Akhil Juneja and Mariana Neves and Maria S{\l}owi{\'n}ska and Christine Hovhannisyan and Aaron Louis Eidt and Lisa Raithel and Sebastian M{\"o}ller and Maija Poikela},
  year={2026},
  institution={DFKI SLT}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for DFKI-SLT/multilingual_DialogPII_NER

Finetuned
(111)
this model