SDVM Multilingual NER โ€” Original

An XLM-RoBERTa-base model fine-tuned for Named Entity Recognition on the original (unrefined) PAN-X.de dataset from the XTREME benchmark.

This model is part of a paired experiment by SDVM to demonstrate the impact of data quality on NER performance. Compare with SDVM/multilingual-ner-refined, which was trained on cleaned data.

Training Details

  • Base model: xlm-roberta-base
  • Dataset: SDVM/xtreme-PAN-X.de โ€” tokens and ner_tags columns (original, uncleaned)
  • Training: 3 epochs, batch size 8, learning rate 2e-5, weight decay 0.01
  • Task: Token classification with IOB2 tags

Labels

ID Tag
0 O
1 B-PER
2 I-PER
3 B-ORG
4 I-ORG
5 B-LOC
6 I-LOC

Usage

from transformers import pipeline

ner = pipeline("token-classification", model="SDVM/multilingual-ner-original")
result = ner("Angela Merkel wurde in Hamburg geboren.")
print(result)

Context

This model was trained on the original PAN-X.de data which contains ~8.5% Wikipedia markup noise tokens (bold markers, quote marks, redirect tags, etc.). These artifacts can confuse the model during both training and inference.

For a cleaner alternative, see SDVM/multilingual-ner-refined.

Reference

Downloads last month
-
Safetensors
Model size
0.3B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train SDVM/multilingual-ner-original

Evaluation results