XLMR-Council-Anonymizer: Personal Data Identification for Portuguese Municipal Meeting Minutes

This model consists of a fine-tuned XLM-RoBERTa Large for the extraction and identification of sensitive personal data in minutes of Portuguese municipal meetings.

Model Description

The XLMR-Council-Anonymizer leverages the multilingual contextual representations of FacebookAI's XLM-RoBERTa, specifically optimized for the linguistic and formal structure of administrative minutes in Portugal. Unlike generic NER models, this model was trained with Weighted Cross-Entropy Loss to handle class imbalance, allowing for effective detection even in entities with few occurrences.

Key Features

  • 🏛️ Specialized for Municipal Minutes: Fine-tuned on authentic Portuguese council meeting minutes
  • 🛡️ Privacy-Focused NER: Identifies and classifies sensitive entities (PII) to support automatic anonymization processes.
  • ⚙️ Transformer-based Architecture: It uses the power of XLM-RoBERTa to capture the grammatical and formal context of administrative documents.

Model Details

  • Base Model: XLM-RoBERTa Large
  • Architecture: Token Classification (NER) with Weighted Cross-Entropy Loss
  • Parameters: ~560M
  • Max Sequence Length: 512 tokens
  • Fine-tuning Dataset: 120 Portuguese meeting minutes (6 municipalities)
  • Evaluation Metrics: F1-Score, Recall and Precision
  • Training Framework: PyTorch + Transformers + Seqeval

Entity Types

The model recognizes 19 entity types in BIO format (49 labels total):

Entity Type Description Example
PERSONAL-NAME Proper names of individuals João Silva
PERSONAL-ADMIN Administrative identifiers and case/process numbers 5597/2023
PERSONAL-POSITION Professional roles, political positions, or technical functions Diretor do Departamento dos Recursos Humanos
PERSONAL-ADDRESS Addresses, street names, and door/plot numbers Rua das Flores n.º 10, Avenida Central
PERSONAL-DATE Dates of events, decisions, or time periods 20/05/2023
PERSONAL-LOCATION Cities, parishes, districts, or geographic locations Freguesia do Porto
PERSONAL-OTHER Generic personal information and miscellaneous contact data Referências de contacto, dados diversos
PERSONAL-INFO Biographical data or sensitive personal information 11490753
PERSONAL-COMPANY Companies or private legal entities Construções & Filho, Lda
PERSONAL-ARTISTIC Nomes artísticos, pseudónimos Pintura
PERSONAL-DEGREE Academic titles or professional degrees Licenciatura de Psicologia
PERSONAL-TIME References to specific times 14:30h
PERSONAL-LICENSE License plates or registration numbers 48-RF-99
PERSONAL-JOB Person's profession or occupation. Professor
PERSONAL-VEHICLE Vehicle identification and models Mercedes-Benz Classe S
PERSONAL-FACULTY Higher education institutions or university faculties Faculdade de Economia da Universidade do Porto
PERSONAL-FAMILY Mentions of kinship, family relationships, or heirs Marido

How It Works

The model performs token-level classification, analyzing each word individually based on its linguistic context. Through this analysis, the system identifies patterns to detect sensitive information using the labels mentioned above and assigns specific labels that allow for the automatic anonymization of the data.

INPUT:

O interessado João Silva submeteu o processo administrativo 5597/2023 no dia 20/05/2023, relativo ao imóvel localizado na Rua das Flores n.º 10.

Output:

O interessado <NAME> submeteu o processo administrativo <ADMIN> no dia <DATE>, relativo ao imóvel localizado na <ADDRESS>.

Results

Overall Performance (Test Set)

Metric Macro (%) Micro (%)
F1 Score 72.67 83.51
Precision 71.21 79.77
Recall 74.87 87.85

Per-Entity Performance

Entity Type Precision (%) Recall (%) F1 Score (%) Support
PERSONAL-NAME 96.43 95.45 95.94 208
PERSONAL-ADMIN 85.96 90.00 87.93 169
PERSONAL-POSITION 60.65 80.34 69.12 130
PERSONAL-ADDRESS 69.57 77.42 73.28 62
PERSONAL-DATE 90.20 93.88 92.00 48
PERSONAL-LOCATION 71.43 64.52 67.80 31
PERSONAL-OTHER 34.78 44.44 39.02 18
PERSONAL-COMPANY 83.33 62.50 71.43 8
PERSONAL-TIME 75.00 100.00 85.71 6
PERSONAL-FAMILY 0.00 0.00 0.00 4
PERSONAL-DEGREE 100.00 100.00 100.00 2
PERSONAL-FACULTY 50.00 50.00 50.00 2
PERSONAL-INFO 100.00 100.00 100.00 1
PERSONAL-ARTISTIC 0.00 0.00 0.00 0
PERSONAL-LICENSE 0.00 0.00 0.00 0
PERSONAL-JOB 0.00 0.00 0.00 0
PERSONAL-VEHICLE 0.00 0.00 0.00 0

Usage

Quick Start

The simplest way to use the model:

from transformers import pipeline

model_name = "inesctec/CitiLink-XLMR-Anonymization-pt"

nlp = pipeline("ner", model=model_name, tokenizer=model_name, aggregation_strategy="simple")

text = "A reunião foi presidida por Manuel Brito no concelho de Alandroal."

results = nlp(text)

for entity in results:
    print(f"Entidade: {entity['word']} | Categoria: {entity['entity_group']} | Score: {entity['score']:.4f}")

Limitations

  • Domain Specificity: Best performance on administrative/governmental meeting minutes
  • Domain-specific: Trained specifically on Portuguese municipal meeting minutes; may not generalize well to other document types
  • Sequence length: Limited to 512 tokens per window

Version: 1.0
Last Updated: 2026-06-25


License

This project uses a custom dual-license based on AGPL v3.

See the full license terms here: LICENSE

Downloads last month
17
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for liaad/CitiLink-XLMR-Anonymization-pt

Finetuned
(980)
this model

Space using liaad/CitiLink-XLMR-Anonymization-pt 1