XLM-R — Moroccan Darija NER (Arabizi / Latin script)

A named-entity recogniser for Moroccan Darija written in Arabizi (Latin script with digits, e.g. 3, 7, 9), four types (PER, LOC, ORG, MISC). XLM-RoBERTa fine-tuned on transliterated Darija NER data (DarNERcorp gold + LLM-teacher silver, transliterated Arabic→Arabizi token-by-token so spans transfer).

Results — native Arabizi gold (span-level micro-F1)

matching F1 95% CI
exact 0.512 [0.461, 0.564]
relaxed (IoU ≥ 0.5) 0.570 [0.520, 0.619]

Evaluated on a native-validated Arabizi gold test set (400 sentences, 252 entities).

Honest note. An Arabic-script Darija model is unusable on raw Arabizi (exact F1 ≈ 0.02 — script gap). A transliterate-then-NER baseline (transliterate the input to Arabic, run an Arabic-script model) reaches ≈ 0.50, i.e. this dedicated Arabizi model matches but does not significantly beat transliteration. For Arabizi Darija NER, transliteration is the key lever; both approaches are bounded by transliteration quality (no large natural-Arabizi NER training data exists).

Training data

Transliterated DarNERcorp gold (CC BY-4.0) + LLM-teacher silver. Native Arabizi gold derived from DODa (MIT).

Licences / attribution

Base model XLM-RoBERTa (MIT). Cite DarNERcorp (CC BY-4.0), DODa (MIT), and (forthcoming) our paper.

Citation

Paper forthcoming (IANLP / ArabicNLP). Until then, cite this repository.

Downloads last month
10
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AchrafSoltani/MoRoBERTa

Finetuned
(4087)
this model