Arabic Flair + fastText Part-of-Speech tagging Model (Egyptian and Levant)

Pretrained Part-of-Speech tagging model built on a joint corpus written in Egyptian and Levantine (Jordanian, Lebanese, Palestinian, Syrian) dialects with code-switching of Egyptian Arabic and English. The model is trained using Flair (forward+backward)and fastText embeddings.

Pretraining Corpora:

This sequence labeling model was pretrained on three corpora jointly:

4 Dialects A Dialectal Arabic Datasets containing four dialects of Arabic, Egyptian (EGY), Levantine (LEV), Gulf (GLF), and Maghrebi (MGR). Each dataset consists of a set of 350 manually segmented and PoS tagged tweets.
UD South Levantine Arabic MADAR A Dataset with 100 manually-annotated sentences taken from the MADAR (Multi-Arabic Dialect Applications and Resources) project by Shorouq Zahra.
Parts of the Cairo Students Code-Switch (CSCS) corpus developed for "Collection and Analysis of Code-switch Egyptian Arabic-English Speech Corpus" by Hamed et al.

Usage

from flair.data import Sentence
from flair.models import SequenceTagger
  
tagger = SequenceTagger.load("megantosh/flair-arabic-dialects-codeswitch-egy-lev")
sentence = Sentence('عمرو عادلي أستاذ للاقتصاد السياسي المساعد في الجامعة الأمريكية  بالقاهرة .')
tagger.predict(sentence)
for entity in sentence.get_spans('pos'):
    print(entity)

Due to the right-to-left in left-to-right context, some formatting errors might occur. and your code might appear like this, (link accessed on 2020-10-27)

Scores & Tagset

	precision	recall	f1-score	support
INTJ	0.8182	0.9000	0.8571	10
OUN	0.9009	0.9402	0.9201	435
NUM	0.9524	0.8333	0.8889	24
ADJ	0.8762	0.7603	0.8142	121
ADP	0.9903	0.9623	0.9761	106
CCONJ	0.9600	0.9730	0.9664	74
PROPN	0.9333	0.9333	0.9333	15
ADV	0.9135	0.8051	0.8559	118
VERB	0.8852	0.9231	0.9038	117
PRON	0.9620	0.9465	0.9542	187
SCONJ	0.8571	0.9474	0.9000	19
PART	0.9350	0.9791	0.9565	191
DET	0.9348	0.9149	0.9247	47
PUNCT	1.0000	1.0000	1.0000	35
AUX	0.9286	0.9811	0.9541	53
MENTION	0.9231	1.0000	0.9600	12
V	0.8571	0.8780	0.8675	82
FUT-PART+V+PREP+PRON	1.0000	0.0000	0.0000	1
PROG-PART+V+PRON+PREP+PRON	0.0000	1.0000	0.0000	0
ADJ+NSUFF	0.6111	0.8462	0.7097	26
NOUN+NSUFF	0.8182	0.8438	0.8308	64
PREP+PRON	0.9565	0.9565	0.9565	23
PUNC	0.9941	1.0000	0.9971	169
EOS	1.0000	1.0000	1.0000	70
NOUN+PRON	0.6986	0.8500	0.7669	60
V+PRON	0.7258	0.8036	0.7627	56
PART+PRON	1.0000	0.9474	0.9730	19
PROG-PART+V	0.8333	0.9302	0.8791	43
DET+NOUN	0.9625	1.0000	0.9809	77
NOUN+NSUFF+PRON	0.9091	0.7143	0.8000	14
PROG-PART+V+PRON	0.7083	0.9444	0.8095	18
PREP+NOUN+NSUFF	0.6667	0.4000	0.5000 5
NOUN+NSUFF+NSUFF	1.0000	0.0000	0.0000	3
CONJ	0.9722	1.0000	0.9859	35
V+PRON+PRON	0.6364	0.5833	0.6087	12
FOREIGN	0.6667	0.6667	0.6667	3
PREP+NOUN	0.6316	0.7500	0.6857	16
DET+NOUN+NSUFF	0.9000	0.9310	0.9153	29
DET+ADJ+NSUFF	1.0000	0.5714	0.7273	7
CONJ+PRON	1.0000	0.8750	0.9333	8
NOUN+CASE	0.0000	0.0000	0.0000	2
DET+ADJ	1.0000	0.6667	0.8000	6
PREP	1.0000	0.9718	0.9857	71
CONJ+FUT-PART+V	0.0000	0.0000	0.0000	1
CONJ+V	0.6667	0.7500	0.7059	8
FUT-PART	1.0000	1.0000	1.0000	2
ADJ+PRON	1.0000	0.0000	0.0000	8
CONJ+PREP+NOUN+PRON	1.0000	0.0000	0.0000	1
CONJ+NOUN+PRON	0.3750	1.0000	0.5455	3
PART+ADJ	1.0000	0.0000	0.0000	1
PART+NOUN	0.5000	1.0000	0.6667	1
CONJ+PREP+NOUN	1.0000	0.0000	0.0000	1
CONJ+NOUN	0.7000	0.7778	0.7368	9
URL	1.0000	1.0000	1.0000	3
CONJ+FUT-PART	1.0000	0.0000	0.0000	1
FUT-PART+V	0.8571	0.6000	0.7059	10
PREP+NOUN+NSUFF+NSUFF	1.0000	0.0000	0.0000	1
HASH	1.0000	0.9412	0.9697	17
ADJ+PREP+PRON	1.0000	0.0000	0.0000	3
PREP+NOUN+PRON	0.0000	0.0000	0.0000	1
EMOT	1.0000	0.8889	0.9412	18
CONJ+PREP	1.0000	0.7500	0.8571	4
PREP+DET+NOUN+NSUFF	1.0000	0.7500	0.8571	4
PRON+DET+NOUN+NSUFF	0.0000	1.0000	0.0000	0
V+PREP+PRON	1.0000	0.0000	0.0000	5
V+PRON+PREP+PRON	0.0000	1.0000	0.0000	0
CONJ+NOUN+NSUFF	0.5000	0.5000	0.5000	2
V+NEG-PART	1.0000	0.0000	0.0000	2
PREP+DET+NOUN	0.9091	1.0000	0.9524	10
PREP+V	1.0000	0.0000	0.0000	2
CONJ+PART	1.0000	0.7778	0.8750	9
CONJ+V+PRON	1.0000	1.0000	1.0000	5
PROG-PART+V+PREP+PRON	1.0000	0.5000	0.6667	2
PREP+NOUN+NSUFF+PRON	1.0000	1.0000	1.0000	1
ADJ+CASE	1.0000	0.0000	0.0000	1
PART+NOUN+PRON	1.0000	1.0000	1.0000	1
PART+V	1.0000	0.0000	0.0000	3
PART+V+PRON	0.0000	1.0000	0.0000	0
FUT-PART+V+PRON	0.0000	1.0000	0.0000	0
FUT-PART+V+PRON+PRON	1.0000	0.0000	0.0000	1
CONJ+PREP+PRON	1.0000	0.0000	0.0000	1
CONJ+V+PRON+PREP+PRON	1.0000	0.0000	0.0000	1
CONJ+V+PREP+PRON	0.0000	1.0000	0.0000	0
CONJ+DET+NOUN+NSUFF	1.0000	0.0000	0.0000	1
CONJ+DET+NOUN	0.6667	1.0000	0.8000	2
CONJ+PREP+DET+NOUN	1.0000	1.0000	1.0000	1
PREP+PART	1.0000	0.0000	0.0000	2
PART+V+PRON+NEG-PART	0.3333	0.3333	0.3333	3
PART+V+NEG-PART	0.3333	0.5000	0.4000	2
PART+PREP+NEG-PART	1.0000	1.0000	1.0000	3
PART+PROG-PART+V+NEG-PART	1.0000	0.3333	0.5000	3
PREP+DET+NOUN+NSUFF+PREP+PRON	1.0000	0.0000	0.0000	1
PREP+PRON+DET+NOUN	0.0000	1.0000	0.0000	0
PART+NSUFF	1.0000	0.0000	0.0000	1
CONJ+PROG-PART+V+PRON	1.0000	1.0000	1.0000	1
PART+PREP+PRON	1.0000	0.0000	0.0000	1
CONJ+PART+PREP	1.0000	0.0000	0.0000	1
NUM+NSUFF	0.6667	0.6667	0.6667	3
CONJ+PART+V+PRON+NEG-PART	1.0000	1.0000	1.0000	1
PART+NOUN+NEG-PART	1.0000	1.0000	1.0000	1
CONJ+ADJ+NSUFF	1.0000	0.0000	0.0000	1
PREP+ADJ	1.0000	0.0000	0.0000	1
ADJ+NSUFF+PRON	1.0000	0.0000	0.0000	2
CONJ+PROG-PART+V	1.0000	0.0000	0.0000	1
CONJ+PART+PROG-PART+V+PREP+PRON+NEG-PART	1.0000	0.0000	0.0000	1
CONJ+PART+PREP+PRON+NEG-PART	0.0000	1.0000	0.0000	0
PREP+PART+PRON	1.0000	0.0000	0.0000	1
CONJ+ADV+NSUFF	1.0000	0.0000	0.0000	1
CONJ+ADV	0.0000	1.0000	0.0000	0
PART+NOUN+PRON+NEG-PART	0.0000	1.0000	0.0000	0
CONJ+ADJ	1.0000	1.0000	1.0000	1

F-score (micro): 0.8974
F-score (macro): 0.5188
Accuracy (incl. no class): 0.901

Expand details below to show class scores for each tag. Note that tag compounds (a tag made for multiple agglutinated parts of speech) are considered as separate ones.

Citation

if you use this model, please consider citing this work:

@unpublished{MMHU21
author = "M. Megahed",
title = "Sequence Labeling Architectures in Diglossia",
year = {2021},
doi = "10.13140/RG.2.2.34961.10084"
url = {https://www.researchgate.net/publication/358956953_Sequence_Labeling_Architectures_in_Diglossia_-_a_case_study_of_Arabic_and_its_dialects}
}