XLM-R — Moroccan Darija NER (Arabizi / Latin script)

A named-entity recogniser for Moroccan Darija written in Arabizi (Latin script with digits, e.g. 3, 7, 9), four types (PER, LOC, ORG, MISC). XLM-RoBERTa fine-tuned on transliterated Darija NER data (DarNERcorp gold + LLM-teacher silver, transliterated Arabic→Arabizi token-by-token so spans transfer).

Results — native Arabizi gold (span-level micro-F1)

matching	F1	95% CI
exact	0.512	[0.461, 0.564]
relaxed (IoU ≥ 0.5)	0.570	[0.520, 0.619]

Evaluated on a native-validated Arabizi gold test set (400 sentences, 252 entities).

Honest note. An Arabic-script Darija model is unusable on raw Arabizi (exact F1 ≈ 0.02 — script gap). A transliterate-then-NER baseline (transliterate the input to Arabic, run an Arabic-script model) reaches ≈ 0.50, i.e. this dedicated Arabizi model matches but does not significantly beat transliteration. For Arabizi Darija NER, transliteration is the key lever; both approaches are bounded by transliteration quality (no large natural-Arabizi NER training data exists).

Training data

Transliterated DarNERcorp gold (CC BY-4.0) + LLM-teacher silver. Native Arabizi gold derived from DODa (MIT).

Licences / attribution

Base model XLM-RoBERTa (MIT). Cite DarNERcorp (CC BY-4.0), DODa (MIT), and (forthcoming) our paper.

Citation

Paper forthcoming (IANLP / ArabicNLP). Until then, cite this repository.

Downloads last month: 10

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for AchrafSoltani/MoRoBERTa

Base model

FacebookAI/xlm-roberta-base

Finetuned

(4087)

this model