Instructions to use liaad/CitiLink-XLMR-Anonymization-pt with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use liaad/CitiLink-XLMR-Anonymization-pt with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="liaad/CitiLink-XLMR-Anonymization-pt")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("liaad/CitiLink-XLMR-Anonymization-pt") model = AutoModelForTokenClassification.from_pretrained("liaad/CitiLink-XLMR-Anonymization-pt") - Notebooks
- Google Colab
- Kaggle
XLMR-Council-Anonymizer: Personal Data Identification for Portuguese Municipal Meeting Minutes
This model consists of a fine-tuned XLM-RoBERTa Large for the extraction and identification of sensitive personal data in minutes of Portuguese municipal meetings.
Model Description
The XLMR-Council-Anonymizer leverages the multilingual contextual representations of FacebookAI's XLM-RoBERTa, specifically optimized for the linguistic and formal structure of administrative minutes in Portugal. Unlike generic NER models, this model was trained with Weighted Cross-Entropy Loss to handle class imbalance, allowing for effective detection even in entities with few occurrences.
Key Features
- 🏛️ Specialized for Municipal Minutes: Fine-tuned on authentic Portuguese council meeting minutes
- 🛡️ Privacy-Focused NER: Identifies and classifies sensitive entities (PII) to support automatic anonymization processes.
- ⚙️ Transformer-based Architecture: It uses the power of XLM-RoBERTa to capture the grammatical and formal context of administrative documents.
Model Details
- Base Model: XLM-RoBERTa Large
- Architecture: Token Classification (NER) with Weighted Cross-Entropy Loss
- Parameters: ~560M
- Max Sequence Length: 512 tokens
- Fine-tuning Dataset: 120 Portuguese meeting minutes (6 municipalities)
- Evaluation Metrics: F1-Score, Recall and Precision
- Training Framework: PyTorch + Transformers + Seqeval
Entity Types
The model recognizes 19 entity types in BIO format (49 labels total):
| Entity Type | Description | Example |
|---|---|---|
PERSONAL-NAME |
Proper names of individuals | João Silva |
PERSONAL-ADMIN |
Administrative identifiers and case/process numbers | 5597/2023 |
PERSONAL-POSITION |
Professional roles, political positions, or technical functions | Diretor do Departamento dos Recursos Humanos |
PERSONAL-ADDRESS |
Addresses, street names, and door/plot numbers | Rua das Flores n.º 10, Avenida Central |
PERSONAL-DATE |
Dates of events, decisions, or time periods | 20/05/2023 |
PERSONAL-LOCATION |
Cities, parishes, districts, or geographic locations | Freguesia do Porto |
PERSONAL-OTHER |
Generic personal information and miscellaneous contact data | Referências de contacto, dados diversos |
PERSONAL-INFO |
Biographical data or sensitive personal information | 11490753 |
PERSONAL-COMPANY |
Companies or private legal entities | Construções & Filho, Lda |
PERSONAL-ARTISTIC |
Nomes artísticos, pseudónimos | Pintura |
PERSONAL-DEGREE |
Academic titles or professional degrees | Licenciatura de Psicologia |
PERSONAL-TIME |
References to specific times | 14:30h |
PERSONAL-LICENSE |
License plates or registration numbers | 48-RF-99 |
PERSONAL-JOB |
Person's profession or occupation. | Professor |
PERSONAL-VEHICLE |
Vehicle identification and models | Mercedes-Benz Classe S |
PERSONAL-FACULTY |
Higher education institutions or university faculties | Faculdade de Economia da Universidade do Porto |
PERSONAL-FAMILY |
Mentions of kinship, family relationships, or heirs | Marido |
How It Works
The model performs token-level classification, analyzing each word individually based on its linguistic context. Through this analysis, the system identifies patterns to detect sensitive information using the labels mentioned above and assigns specific labels that allow for the automatic anonymization of the data.
INPUT:
O interessado João Silva submeteu o processo administrativo 5597/2023 no dia 20/05/2023, relativo ao imóvel localizado na Rua das Flores n.º 10.
Output:
O interessado <NAME> submeteu o processo administrativo <ADMIN> no dia <DATE>, relativo ao imóvel localizado na <ADDRESS>.
Results
Overall Performance (Test Set)
| Metric | Macro (%) | Micro (%) |
|---|---|---|
| F1 Score | 72.67 | 83.51 |
| Precision | 71.21 | 79.77 |
| Recall | 74.87 | 87.85 |
Per-Entity Performance
| Entity Type | Precision (%) | Recall (%) | F1 Score (%) | Support |
|---|---|---|---|---|
PERSONAL-NAME |
96.43 | 95.45 | 95.94 | 208 |
PERSONAL-ADMIN |
85.96 | 90.00 | 87.93 | 169 |
PERSONAL-POSITION |
60.65 | 80.34 | 69.12 | 130 |
PERSONAL-ADDRESS |
69.57 | 77.42 | 73.28 | 62 |
PERSONAL-DATE |
90.20 | 93.88 | 92.00 | 48 |
PERSONAL-LOCATION |
71.43 | 64.52 | 67.80 | 31 |
PERSONAL-OTHER |
34.78 | 44.44 | 39.02 | 18 |
PERSONAL-COMPANY |
83.33 | 62.50 | 71.43 | 8 |
PERSONAL-TIME |
75.00 | 100.00 | 85.71 | 6 |
PERSONAL-FAMILY |
0.00 | 0.00 | 0.00 | 4 |
PERSONAL-DEGREE |
100.00 | 100.00 | 100.00 | 2 |
PERSONAL-FACULTY |
50.00 | 50.00 | 50.00 | 2 |
PERSONAL-INFO |
100.00 | 100.00 | 100.00 | 1 |
PERSONAL-ARTISTIC |
0.00 | 0.00 | 0.00 | 0 |
PERSONAL-LICENSE |
0.00 | 0.00 | 0.00 | 0 |
PERSONAL-JOB |
0.00 | 0.00 | 0.00 | 0 |
PERSONAL-VEHICLE |
0.00 | 0.00 | 0.00 | 0 |
Usage
Quick Start
The simplest way to use the model:
from transformers import pipeline
model_name = "inesctec/CitiLink-XLMR-Anonymization-pt"
nlp = pipeline("ner", model=model_name, tokenizer=model_name, aggregation_strategy="simple")
text = "A reunião foi presidida por Manuel Brito no concelho de Alandroal."
results = nlp(text)
for entity in results:
print(f"Entidade: {entity['word']} | Categoria: {entity['entity_group']} | Score: {entity['score']:.4f}")
Limitations
- Domain Specificity: Best performance on administrative/governmental meeting minutes
- Domain-specific: Trained specifically on Portuguese municipal meeting minutes; may not generalize well to other document types
- Sequence length: Limited to 512 tokens per window
Version: 1.0
Last Updated: 2026-06-25
License
This project uses a custom dual-license based on AGPL v3.
See the full license terms here: LICENSE
- Downloads last month
- 17
Model tree for liaad/CitiLink-XLMR-Anonymization-pt
Base model
FacebookAI/xlm-roberta-large