XLMR-Council-Anonymizer: Personal Data Identification for Portuguese Municipal Meeting Minutes

This model consists of a fine-tuned XLM-RoBERTa Large for the extraction and identification of sensitive personal data in minutes of Portuguese municipal meetings.

Model Description

The XLMR-Council-Anonymizer leverages the multilingual contextual representations of FacebookAI's XLM-RoBERTa, specifically optimized for the linguistic and formal structure of administrative minutes in Portugal. Unlike generic NER models, this model was trained with Weighted Cross-Entropy Loss to handle class imbalance, allowing for effective detection even in entities with few occurrences.

Key Features

🏛️ Specialized for Municipal Minutes: Fine-tuned on authentic Portuguese council meeting minutes
🛡️ Privacy-Focused NER: Identifies and classifies sensitive entities (PII) to support automatic anonymization processes.
⚙️ Transformer-based Architecture: It uses the power of XLM-RoBERTa to capture the grammatical and formal context of administrative documents.

Model Details

Base Model: XLM-RoBERTa Large
Architecture: Token Classification (NER) with Weighted Cross-Entropy Loss
Parameters: ~560M
Max Sequence Length: 512 tokens
Fine-tuning Dataset: 120 Portuguese meeting minutes (6 municipalities)
Evaluation Metrics: F1-Score, Recall and Precision
Training Framework: PyTorch + Transformers + Seqeval

Entity Types

The model recognizes 19 entity types in BIO format (49 labels total):

Entity Type	Description	Example
`PERSONAL-NAME`	Proper names of individuals	João Silva
`PERSONAL-ADMIN`	Administrative identifiers and case/process numbers	5597/2023
`PERSONAL-POSITION`	Professional roles, political positions, or technical functions	Diretor do Departamento dos Recursos Humanos
`PERSONAL-ADDRESS`	Addresses, street names, and door/plot numbers	Rua das Flores n.º 10, Avenida Central
`PERSONAL-DATE`	Dates of events, decisions, or time periods	20/05/2023
`PERSONAL-LOCATION`	Cities, parishes, districts, or geographic locations	Freguesia do Porto
`PERSONAL-OTHER`	Generic personal information and miscellaneous contact data	Referências de contacto, dados diversos
`PERSONAL-INFO`	Biographical data or sensitive personal information	11490753
`PERSONAL-COMPANY`	Companies or private legal entities	Construções & Filho, Lda
`PERSONAL-ARTISTIC`	Nomes artísticos, pseudónimos	Pintura
`PERSONAL-DEGREE`	Academic titles or professional degrees	Licenciatura de Psicologia
`PERSONAL-TIME`	References to specific times	14:30h
`PERSONAL-LICENSE`	License plates or registration numbers	48-RF-99
`PERSONAL-JOB`	Person's profession or occupation.	Professor
`PERSONAL-VEHICLE`	Vehicle identification and models	Mercedes-Benz Classe S
`PERSONAL-FACULTY`	Higher education institutions or university faculties	Faculdade de Economia da Universidade do Porto
`PERSONAL-FAMILY`	Mentions of kinship, family relationships, or heirs	Marido

How It Works

The model performs token-level classification, analyzing each word individually based on its linguistic context. Through this analysis, the system identifies patterns to detect sensitive information using the labels mentioned above and assigns specific labels that allow for the automatic anonymization of the data.

INPUT:

O interessado João Silva submeteu o processo administrativo 5597/2023 no dia 20/05/2023, relativo ao imóvel localizado na Rua das Flores n.º 10.

Output:

O interessado <NAME> submeteu o processo administrativo <ADMIN> no dia <DATE>, relativo ao imóvel localizado na <ADDRESS>.

Results

Overall Performance (Test Set)

Metric	Macro (%)	Micro (%)
F1 Score	72.67	83.51
Precision	71.21	79.77
Recall	74.87	87.85

Per-Entity Performance

Entity Type	Precision (%)	Recall (%)	F1 Score (%)	Support
`PERSONAL-NAME`	96.43	95.45	95.94	208
`PERSONAL-ADMIN`	85.96	90.00	87.93	169
`PERSONAL-POSITION`	60.65	80.34	69.12	130
`PERSONAL-ADDRESS`	69.57	77.42	73.28	62
`PERSONAL-DATE`	90.20	93.88	92.00	48
`PERSONAL-LOCATION`	71.43	64.52	67.80	31
`PERSONAL-OTHER`	34.78	44.44	39.02	18
`PERSONAL-COMPANY`	83.33	62.50	71.43	8
`PERSONAL-TIME`	75.00	100.00	85.71	6
`PERSONAL-FAMILY`	0.00	0.00	0.00	4
`PERSONAL-DEGREE`	100.00	100.00	100.00	2
`PERSONAL-FACULTY`	50.00	50.00	50.00	2
`PERSONAL-INFO`	100.00	100.00	100.00	1
`PERSONAL-ARTISTIC`	0.00	0.00	0.00	0
`PERSONAL-LICENSE`	0.00	0.00	0.00	0
`PERSONAL-JOB`	0.00	0.00	0.00	0
`PERSONAL-VEHICLE`	0.00	0.00	0.00	0

Usage

Quick Start

The simplest way to use the model:

from transformers import pipeline

model_name = "inesctec/CitiLink-XLMR-Anonymization-pt"

nlp = pipeline("ner", model=model_name, tokenizer=model_name, aggregation_strategy="simple")

text = "A reunião foi presidida por Manuel Brito no concelho de Alandroal."

results = nlp(text)

for entity in results:
    print(f"Entidade: {entity['word']} | Categoria: {entity['entity_group']} | Score: {entity['score']:.4f}")

Limitations

Domain Specificity: Best performance on administrative/governmental meeting minutes
Domain-specific: Trained specifically on Portuguese municipal meeting minutes; may not generalize well to other document types
Sequence length: Limited to 512 tokens per window

Version: 1.0
Last Updated: 2026-06-25

License

This project uses a custom dual-license based on AGPL v3.

See the full license terms here: LICENSE

Downloads last month: 17

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for liaad/CitiLink-XLMR-Anonymization-pt

Base model

FacebookAI/xlm-roberta-large

Finetuned

(980)

this model

liaad
/

CitiLink-XLMR-Anonymization-pt