Multilingual DialogPII NER

A fine-tuned jhu-clsp/mmBERT-base model with a CRF layer for Personally Identifiable Information (PII) detection in multilingual dialogues across 11 languages.

Model Description

This model performs token-level Named Entity Recognition (NER) to identify and classify PII entities in dialogue text. It was trained on synthetic multilingual de-identification of conversational data.

Architecture: mmBERT-base (ModernBERT) + CRF head with FLERT context windowing
Training: Fine-tuned on all 11 languages jointly (multilingual training) using FLERT-style document context
Loss: Cross-Entropy
Hyperparameters: lr=2e-05, batch_size=32, max_length=2048, dropout=0.1, epochs=10
Context window: 2 sentences left + 2 sentences right, separated by [SEP] markers
Decoding: Viterbi decoding via CRF layer

Supported Languages

Code	Language
AR	Arabic
DE	German
EN	English
FI	Finnish
FR	French
HI	Hindi
IT	Italian
PL	Polish
PT	Portuguese
SP	Spanish
TR	Turkish

Entity Types

The model recognizes 19 PII entity types using BIO tagging:

Entity	Description
`PERSON`	Person names
`PERSON_EMAIL`	Email addresses
`PERSON_SOCIAL_RELATION`	Social relations (e.g., "my wife")
`ORG`	Organizations
`LOC_CITY`	Cities
`LOC_COUNTRY`	Countries
`LOC_STREET`	Street names
`LOC_ZIP`	ZIP/postal codes
`LOC_HOUSENUMBER`	House numbers
`LOC_OTHER`	Other locations
`DATETIME`	Dates and times
`DATETIME_AGE`	Ages
`CODE`	ID numbers, reference codes
`CODE_PHONE`	Phone numbers
`CODE_URL`	URLs
`PROFESSION`	Professions
`PRODUCT`	Product names
`QUANTITY`	Quantities
`MISC`	Miscellaneous PII

Performance

Evaluated on held-out test sets per language (type-aware micro scores):

Language	Len P	Len R	Len F1	Len F2	Ex P	Ex R	Ex F1	Ex F2
AR	87.87	73.15	79.84	75.69	84.45	70.30	76.73	72.74
DE	94.12	90.66	92.36	91.33	93.33	89.90	91.58	90.56
EN	94.93	93.45	94.18	93.74	92.41	90.97	91.69	91.25
FI	91.36	88.46	89.89	89.03	89.93	87.07	88.48	87.63
FR	90.91	88.09	89.48	88.64	87.66	84.94	86.28	85.47
HI	87.55	82.33	84.86	83.33	83.37	78.40	80.81	79.35
IT	93.57	87.81	90.60	88.90	90.72	85.13	87.84	86.19
PL	90.11	90.31	90.21	90.27	87.41	87.61	87.51	87.57
PT	91.10	90.69	90.90	90.77	89.28	88.88	89.08	88.96
SP	93.06	91.47	92.26	91.79	91.30	89.74	90.51	90.05
TR	89.13	86.53	87.81	87.04	85.79	83.29	84.52	83.78
AVG	91.25	87.54	89.31	88.23	88.70	85.11	86.82	85.78

Usage

This model uses a custom CRF architecture with FLERT-style context windowing and cannot be loaded directly with AutoModelForTokenClassification. You need to use the custom ModernBertCRF class.

Note: The config.json in this repo exists solely for Hugging Face download tracking. For model loading, use crf_config.json and flert_config.json instead.

Setup

import torch
import json
import re
import torch.nn as nn
import spacy
from transformers import AutoModel, AutoTokenizer
from torchcrf import CRF
from huggingface_hub import snapshot_download

class ModernBertCRF(nn.Module):
    def __init__(self, base_model_name, num_labels, id2label, label2id):
        super().__init__()
        self.num_labels = num_labels
        self.id2label = id2label
        self.label2id = label2id
        self.transformer = AutoModel.from_pretrained(base_model_name)
        hidden_size = self.transformer.config.hidden_size
        self.classifier = nn.Linear(hidden_size, num_labels)
        self.dropout = nn.Dropout(0.1)
        self.crf = CRF(num_labels, batch_first=True)

    def forward(self, input_ids, attention_mask, labels=None, **kwargs):
        kwargs.pop("token_type_ids", None)
        outputs = self.transformer(input_ids=input_ids, attention_mask=attention_mask)
        sequence_output = self.dropout(outputs.last_hidden_state)
        emissions = self.classifier(sequence_output)
        return {"logits": emissions}

    def decode(self, emissions, mask):
        return self.crf.decode(emissions, mask=mask)

# Load model
model_dir = snapshot_download("DFKI-SLT/multilingual_DialogPII_NER")

with open(f"{model_dir}/crf_config.json") as f:
    config = json.load(f)

with open(f"{model_dir}/flert_config.json") as f:
    flert_config = json.load(f)

model = ModernBertCRF(
    base_model_name=config["base_model_name"],
    num_labels=config["num_labels"],
    id2label=config["id2label"],
    label2id=config["label2id"],
)
model.load_state_dict(torch.load(f"{model_dir}/pytorch_model.bin", map_location="cpu"))
model.eval()

tokenizer = AutoTokenizer.from_pretrained(model_dir)
id2label = {int(k): v for k, v in config["id2label"].items()}

context_window = flert_config["context_window"]     # 2
use_sep_marker = flert_config["context_sep_marker"]  # True

Preprocessing: Sentence Splitting

The model was trained using FLERT-style context windowing over sentence-level input. Each sentence is predicted with surrounding context sentences. For best results, split your input into sentences using spaCy before inference.

nlp = spacy.blank("en")          # use "de" for German, "xx" for multilingual
nlp.add_pipe("sentencizer")

def split_dialogue(text, nlp):
    sentences = []
    for line in text.strip().splitlines():
        m = re.match(r"^(SPEAKER_\d+)\s*:\s*(.*)", line.strip())
        if m:
            speaker, rest = m.group(1), m.group(2)
            sentences.append([speaker, ":"])
            line = rest
        if not line:
            continue
        doc = nlp(line)
        for sent in doc.sents:
            tokens = [tok.text for tok in sent if not tok.is_space]
            if tokens:
                sentences.append(tokens)
    return sentences

# Example
raw = """SPEAKER_00: Hello, my name is Peter.
SPEAKER_01: Hello, my name is Peter as well. Okay, and where do you come from? I come from Chicago."""

sentences = split_dialogue(raw, nlp)

Inference with FLERT Context Windowing

The key difference from standard token classification: each sentence is predicted within a window of surrounding context sentences, joined by [SEP] tokens. Only labels for the target sentence are extracted.

def predict_dialogue(sentences, model, tokenizer, id2label,
                     context_window=2, use_sep_marker=True, device="cpu"):
    sep = tokenizer.sep_token
    all_labels = []
    for i, target_tokens in enumerate(sentences):
        left  = sentences[max(0, i - context_window):i]
        right = sentences[i + 1:i + 1 + context_window]

        flat_tokens = []
        for s in left:
            flat_tokens.extend(s)
        if use_sep_marker and left:
            flat_tokens.append(sep)

        tgt_start = len(flat_tokens)
        flat_tokens.extend(target_tokens)
        tgt_end = len(flat_tokens)

        if use_sep_marker and right:
            flat_tokens.append(sep)
        for s in right:
            flat_tokens.extend(s)

        enc = tokenizer(flat_tokens, is_split_into_words=True,
                        return_tensors="pt", truncation=False).to(device)
        word_ids = enc.word_ids(batch_index=0)

        with torch.no_grad():
            emissions = model(**enc)["logits"]
            mask = enc["attention_mask"].bool()
            preds = model.decode(emissions, mask)[0]

        word_labels = ["O"] * len(target_tokens)
        seen = set()
        for idx, wid in enumerate(word_ids):
            if wid is None or wid in seen:
                continue
            seen.add(wid)
            if tgt_start <= wid < tgt_end:
                word_labels[wid - tgt_start] = id2label[preds[idx]]

        all_labels.append(word_labels)
    return all_labels


# Run prediction
results = predict_dialogue(sentences, model, tokenizer, id2label,
                           context_window=context_window,
                           use_sep_marker=use_sep_marker)

for sent_tokens, sent_labels in zip(sentences, results):
    for token, label in zip(sent_tokens, sent_labels):
        if label != "O":
            print(f"{token:20s} -> {label}")

Single-sentence inference

For isolated sentences without dialogue context, pass them with context_window=0:

tokens = ["My", "name", "is", "John", "Smith", "and", "I", "live", "in", "Berlin", "."]

results = predict_dialogue([tokens], model, tokenizer, id2label,
                           context_window=0, use_sep_marker=False)

for token, label in zip(tokens, results[0]):
    if label != "O":
        print(f"{token:20s} -> {label}")

Training Data

The model was trained on synthetic multilingual dialogue data covering various domains (medical anamnesis, customer support, police reports, therapy sessions, etc.). The data was generated and annotated as part of a thesis project on multilingual PII de-identification.

Limitations

Trained on synthetic dialogue data; performance on real-world data may vary
Optimized for dialogue/conversational text; may underperform on formal documents
Arabic and Hindi show lower performance compared to European languages
Requires pytorch-crf package for inference

Citation

If you use this model, please cite:

@misc{roller2026multilingual,
  title={DialogPII: A multilingual dataset of synthetic dialog transcripts to detect personal information},
  author={Roland Roller and Vera Czehmann and Derya Erman and Luke Flanagan and Ibrahim Baroud and Fr{\'e}d{\'e}ric Blain and Viviana Cotik and Eletta Giusto and Akhil Juneja and Mariana Neves and Maria S{\l}owi{\'n}ska and Christine Hovhannisyan and Aaron Louis Eidt and Lisa Raithel and Sebastian M{\"o}ller and Maija Poikela},
  year={2026},
  institution={DFKI SLT}
}

Downloads last month: -

Model tree for DFKI-SLT/multilingual_DialogPII_NER

Base model

jhu-clsp/mmBERT-base

Finetuned

(111)

this model