SDVM Multilingual NER — Original

An XLM-RoBERTa-base model fine-tuned for Named Entity Recognition on the original (unrefined) PAN-X.de dataset from the XTREME benchmark.

This model is part of a paired experiment by SDVM to demonstrate the impact of data quality on NER performance. Compare with SDVM/multilingual-ner-refined, which was trained on cleaned data.

Training Details

Base model: xlm-roberta-base
Dataset: SDVM/xtreme-PAN-X.de — tokens and ner_tags columns (original, uncleaned)
Training: 3 epochs, batch size 8, learning rate 2e-5, weight decay 0.01
Task: Token classification with IOB2 tags

Labels

ID	Tag
0	O
1	B-PER
2	I-PER
3	B-ORG
4	I-ORG
5	B-LOC
6	I-LOC

Usage

from transformers import pipeline

ner = pipeline("token-classification", model="SDVM/multilingual-ner-original")
result = ner("Angela Merkel wurde in Hamburg geboren.")
print(result)

Context

This model was trained on the original PAN-X.de data which contains ~8.5% Wikipedia markup noise tokens (bold markers, quote marks, redirect tags, etc.). These artifacts can confuse the model during both training and inference.

For a cleaner alternative, see SDVM/multilingual-ner-refined.

Reference

Based on Chapter 4 of Natural Language Processing with Transformers
Part of the SDVM data quality demonstration series

Downloads last month: -

Safetensors

Model size

0.3B params

Tensor type

F32

Dataset used to train SDVM/multilingual-ner-original

Evaluation results

F1 on PAN-X.de (Original)
test set self-reported

0.880