XLM-R — Moroccan Darija NER (Arabizi / Latin script)
A named-entity recogniser for Moroccan Darija written in Arabizi (Latin script with digits, e.g. 3, 7, 9), four types (PER, LOC, ORG, MISC). XLM-RoBERTa fine-tuned on transliterated Darija NER data (DarNERcorp gold + LLM-teacher silver, transliterated Arabic→Arabizi token-by-token so spans transfer).
Results — native Arabizi gold (span-level micro-F1)
| matching | F1 | 95% CI |
|---|---|---|
| exact | 0.512 | [0.461, 0.564] |
| relaxed (IoU ≥ 0.5) | 0.570 | [0.520, 0.619] |
Evaluated on a native-validated Arabizi gold test set (400 sentences, 252 entities).
Honest note. An Arabic-script Darija model is unusable on raw Arabizi (exact F1 ≈ 0.02 — script gap). A transliterate-then-NER baseline (transliterate the input to Arabic, run an Arabic-script model) reaches ≈ 0.50, i.e. this dedicated Arabizi model matches but does not significantly beat transliteration. For Arabizi Darija NER, transliteration is the key lever; both approaches are bounded by transliteration quality (no large natural-Arabizi NER training data exists).
Training data
Transliterated DarNERcorp gold (CC BY-4.0) + LLM-teacher silver. Native Arabizi gold derived from DODa (MIT).
Licences / attribution
Base model XLM-RoBERTa (MIT). Cite DarNERcorp (CC BY-4.0), DODa (MIT), and (forthcoming) our paper.
Citation
Paper forthcoming (IANLP / ArabicNLP). Until then, cite this repository.
- Downloads last month
- 10
Model tree for AchrafSoltani/MoRoBERTa
Base model
FacebookAI/xlm-roberta-base