File size: 11,067 Bytes
f80b98d
 
 
 
 
 
 
 
 
 
 
 
c1087c7
b2469f6
f80b98d
 
b2469f6
f80b98d
 
 
3858726
05316b4
9d9fe49
f80b98d
 
 
971113c
 
 
 
 
 
 
 
 
 
 
f80b98d
971113c
 
f3a0748
 
 
 
 
 
 
 
 
 
971113c
483d962
 
941b920
971113c
941b920
 
 
2cdbc37
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bd851f0
 
 
 
 
 
f790d35
 
971113c
 
d1c87fa
971113c
f790d35
 
 
971113c
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
---
language: 
- ar
- en
license: apache-2.0
datasets:
- 4Dialects
- MADAR
- CSCS
thumbnail: https://www.informatik.hu-berlin.de/en/forschung-en/gebiete/ml-en/resolveuid/a6f82e0d7fa446a59c902cac4cafa9cb/@@images/image/preview
tags:
- flair
- token-classification
- sequence-tagger-model
- Dialectal Arabic
- Code-Switching
- Code-Mixing
metrics:
- f1
widget:
- text: "طلعوا جماعة الممانعة بالسياسة ما بيعرفوا ولا بالصحة بيعرفوا ولا حتى بالدين"
- text: "أعلم أن هذا يبدو غير عادل ، لكن لا يمكن أن يكون هناك ظلم"
- text: "أنا عارف أن الموضوع ده شكله مش عادل ، بس لا يمكن أن يكون فيه ظلم"
---


# Arabic Flair + fastText Part-of-Speech tagging Model (Egyptian and Levant)
Pretrained Part-of-Speech tagging model built on a joint corpus written in Egyptian and Levantine (Jordanian, Lebanese, Palestinian, Syrian) dialects with code-switching of Egyptian Arabic and English. The model is trained using [Flair](https://aclanthology.org/C18-1139/) (forward+backward)and [fastText](https://fasttext.cc) embeddings.



# Pretraining Corpora:
This sequence labeling model was pretrained on three corpora jointly:
1. [4 Dialects](https://huggingface.co/datasets/viewer/?dataset=arabic_pos_dialect)
A Dialectal Arabic Datasets containing four dialects of Arabic, Egyptian (EGY), Levantine (LEV), Gulf (GLF), and Maghrebi (MGR). Each dataset consists of a set of 350 manually segmented and PoS tagged tweets.
2. [UD South Levantine Arabic MADAR](https://universaldependencies.org/treebanks/ajp_madar/index.html)
A Dataset with 100 manually-annotated sentences taken from the [MADAR](https://camel.abudhabi.nyu.edu/madar/) (Multi-Arabic Dialect Applications and Resources) project by [Shorouq Zahra](mailto:shorouqjzahra@gmail.com).
3. Parts of the Cairo Students Code-Switch (CSCS) corpus developed for ["Collection and Analysis of Code-switch Egyptian Arabic-English Speech Corpus"](https://aclanthology.org/L18-1601.pdf) by Hamed et al.

# Usage
```python
from flair.data import Sentence
from flair.models import SequenceTagger
  
tagger = SequenceTagger.load("megantosh/flair-arabic-dialects-codeswitch-egy-lev")
sentence = Sentence('عمرو عادلي أستاذ للاقتصاد السياسي المساعد في الجامعة الأمريكية  بالقاهرة .')
tagger.predict(sentence)
for entity in sentence.get_spans('pos'):
    print(entity)
```

Due to the right-to-left in left-to-right context, some formatting errors might occur. and your code might appear like [this](https://ibb.co/ky20Lnq), (link accessed on 2020-10-27) 

<!--# Example

# Tagset-->

# Scores & Tagset
<details> 

| |precision |   recall | f1-score |  support|
|--|-----------|------|-------------|--------------|
|INTJ |    0.8182   | 0.9000    |0.8571    |    10|
|OUN   |  0.9009   | 0.9402    |0.9201      | 435|
|NUM    | 0.9524   | 0.8333   | 0.8889       | 24|
|ADJ     |0.8762   | 0.7603  |  0.8142      | 121|
|ADP     |0.9903    |0.9623 |   0.9761       |106|
| CCONJ |    0.9600   | 0.9730 |   0.9664 |       74|
|PROPN |    0.9333   | 0.9333  |  0.9333  |      15|
| ADV  |   0.9135   | 0.8051  |  0.8559   |    118|
|VERB   |  0.8852    | 0.9231 |   0.9038   |    117|
|PRON    | 0.9620    | 0.9465 |   0.9542    |   187|
|SCONJ |    0.8571   | 0.9474  |  0.9000      |  19|
|PART  |   0.9350   | 0.9791   | 0.9565       | 191|
| DET   |  0.9348    | 0.9149  |  0.9247 |       47|
|PUNCT    | 1.0000    | 1.0000  |  1.0000  |      35|
| AUX  |   0.9286    | 0.9811  |  0.9541   |     53|
|MENTION   |  0.9231   |  1.0000  |  0.9600    |    12|
|     V    | 0.8571   | 0.8780    | 0.8675     |   82|
| FUT-PART+V+PREP+PRON     |1.0000   | 0.0000   | 0.0000       |  1|
|  PROG-PART+V+PRON+PREP+PRON |     0.0000  |  1.0000  |  0.0000       |  0|
|ADJ+NSUFF |    0.6111   | 0.8462   | 0.7097 |       26|
|NOUN+NSUFF  |   0.8182   | 0.8438   | 0.8308  |      64|
|PREP+PRON   |  0.9565   | 0.9565   | 0.9565   |     23|
|                   PUNC    | 0.9941   | 1.0000   | 0.9971    |   169|
|                    EOS     |1.0000   | 1.0000   | 1.0000    |   70|
|             NOUN+PRON   |  0.6986   | 0.8500   | 0.7669      |  60|
|                V+PRON    | 0.7258   | 0.8036   | 0.7627       | 56|
|            PART+PRON    | 1.0000   | 0.9474   | 0.9730    |    19|
|          PROG-PART+V    | 0.8333   | 0.9302   | 0.8791 |       43|
|            DET+NOUN    | 0.9625   | 1.0000   | 0.9809  |      77|
|     NOUN+NSUFF+PRON    | 0.9091   | 0.7143   | 0.8000   |     14|
|     PROG-PART+V+PRON    | 0.7083   | 0.9444   | 0.8095    |    18|
|      PREP+NOUN+NSUFF    | 0.6667   | 0.4000   | 0.5000         5|
|     NOUN+NSUFF+NSUFF    | 1.0000   | 0.0000   | 0.0000 |        3|
|                CONJ    | 0.9722   | 1.0000   | 0.9859  |      35|
|        V+PRON+PRON    | 0.6364   | 0.5833   | 0.6087   |     12|
|           FOREIGN    | 0.6667   | 0.6667   | 0.6667    |     3|
|        PREP+NOUN    | 0.6316   | 0.7500  |  0.6857 |       16|
|  DET+NOUN+NSUFF    | 0.9000   | 0.9310  |  0.9153  |      29|
|  DET+ADJ+NSUFF    | 1.0000   | 0.5714  |  0.7273   |      7|
|     CONJ+PRON    | 1.0000   | 0.8750  |  0.9333     |    8|
|    NOUN+CASE    | 0.0000   | 0.0000  |  0.0000    |     2|
|     DET+ADJ    | 1.0000   | 0.6667  |  0.8000      |   6|
|       PREP    | 1.0000   | 0.9718  |  0.9857  |      71|
|  CONJ+FUT-PART+V    | 0.0000   | 0.0000  |  0.0000   |      1|
|            CONJ+V    | 0.6667   | 0.7500  |  0.7059    |     8|
|         FUT-PART    | 1.0000   | 1.0000  |  1.0000     |    2|
|             ADJ+PRON    | 1.0000   | 0.0000  |  0.0000      |   8|
|   CONJ+PREP+NOUN+PRON    | 1.0000   | 0.0000  |  0.0000       |  1|
|        CONJ+NOUN+PRON    | 0.3750   | 1.0000  |  0.5455      |   3|
|              PART+ADJ    | 1.0000   | 0.0000  |  0.0000       |  1|
|             PART+NOUN    | 0.5000   | 1.0000  |  0.6667        | 1|
|       CONJ+PREP+NOUN    | 1.0000   | 0.0000  |  0.0000       |  1|
|           CONJ+NOUN    | 0.7000   | 0.7778  |  0.7368  |       9|
|                URL    | 1.0000   | 1.0000   | 1.0000 |        3|
|     CONJ+FUT-PART    | 1.0000   | 0.0000   | 0.0000  |       1|
|       FUT-PART+V    | 0.8571   | 0.6000   | 0.7059   |     10|
|      PREP+NOUN+NSUFF+NSUFF    | 1.0000   | 0.0000    | 0.0000   |      1|
|                      HASH    | 1.0000   | 0.9412   | 0.9697     |   17|
|            ADJ+PREP+PRON    | 1.0000   | 0.0000   | 0.0000  |       3|
|          PREP+NOUN+PRON    | 0.0000   | 0.0000   | 0.0000   |      1|
|                   EMOT    | 1.0000   | 0.8889   | 0.9412    |    18|
|             CONJ+PREP    | 1.0000   | 0.7500   | 0.8571     |    4|
|  PREP+DET+NOUN+NSUFF    | 1.0000   | 0.7500   | 0.8571      |   4|
| PRON+DET+NOUN+NSUFF    | 0.0000   | 1.0000   | 0.0000       |  0|
|        V+PREP+PRON    | 1.0000   | 0.0000   | 0.0000        | 5|
|  V+PRON+PREP+PRON    | 0.0000   | 1.0000   | 0.0000         | 0|
|  CONJ+NOUN+NSUFF    | 0.5000   | 0.5000   | 0.5000 |        2|
|      V+NEG-PART    | 1.0000   | 0.0000   | 0.0000  |       2|
|  PREP+DET+NOUN    | 0.9091   | 1.0000   | 0.9524   |     10|
|        PREP+V    | 1.0000   | 0.0000   | 0.0000    |     2|
|    CONJ+PART    | 1.0000   | 0.7778   | 0.8750     |    9|
| CONJ+V+PRON    | 1.0000   | 1.0000   | 1.0000 |        5|
|    PROG-PART+V+PREP+PRON    | 1.0000   | 0.5000   | 0.6667  |       2|
|    PREP+NOUN+NSUFF+PRON    | 1.0000   | 1.0000   | 1.0000   |      1|
|               ADJ+CASE    | 1.0000   | 0.0000    | 0.0000   |      1|
|        PART+NOUN+PRON    | 1.0000   | 1.0000   | 1.0000     |    1|
|               PART+V    | 1.0000   | 0.0000  |  0.0000      |   3|
|         PART+V+PRON    | 0.0000   | 1.0000  |  0.0000       |  0|
|    FUT-PART+V+PRON    | 0.0000   | 1.0000  |  0.0000        | 0|
|FUT-PART+V+PRON+PRON    | 1.0000   | 0.0000  |  0.0000  |       1|
|     CONJ+PREP+PRON    | 1.0000   | 0.0000  |  0.0000   |      1|
|CONJ+V+PRON+PREP+PRON    | 1.0000   | 0.0000  |  0.0000    |     1|
|    CONJ+V+PREP+PRON    | 0.0000   | 1.0000  |  0.0000     |    0|
|CONJ+DET+NOUN+NSUFF    | 1.0000   | 0.0000  |  0.0000      |   1|
|     CONJ+DET+NOUN    | 0.6667   | 1.0000  |  0.8000    |     2|
| CONJ+PREP+DET+NOUN   |  1.0000  |  1.0000 |   1.0000  |       1|
|       PREP+PART    | 1.0000   | 0.0000  |  0.0000  |       2|
|      PART+V+PRON+NEG-PART    | 0.3333   | 0.3333  |  0.3333         | 3|
|          PART+V+NEG-PART    | 0.3333   | 0.5000  |  0.4000        | 2|
|      PART+PREP+NEG-PART    | 1.0000   | 1.0000  |  1.0000       |  3|
| PART+PROG-PART+V+NEG-PART    | 1.0000   | 0.3333   | 0.5000      |   3|
| PREP+DET+NOUN+NSUFF+PREP+PRON   |  1.0000  |  0.0000  |  0.0000    |     1|
|         PREP+PRON+DET+NOUN    | 0.0000   | 1.0000    | 0.0000   |      0|
|                PART+NSUFF    | 1.0000   | 0.0000    | 0.0000  |       1|
|    CONJ+PROG-PART+V+PRON    | 1.0000   | 1.0000   | 1.0000    |     1|
|          PART+PREP+PRON    | 1.0000   | 0.0000   | 0.0000   |      1|
|         CONJ+PART+PREP    | 1.0000   | 0.0000    | 0.0000        | 1|
|             NUM+NSUFF    | 0.6667   | 0.6667   | 0.6667        | 3|
| CONJ+PART+V+PRON+NEG-PART   |  1.0000  |  1.0000  |  1.0000      |   1|
|     PART+NOUN+NEG-PART    | 1.0000   | 1.0000   | 1.0000      |   1|
|        CONJ+ADJ+NSUFF     | 1.0000  |  0.0000  |  0.0000    |     1|
|             PREP+ADJ     | 1.0000  |  0.0000  |  0.0000   |      1|
|      ADJ+NSUFF+PRON     | 1.0000  |  0.0000  |  0.0000  |       2|
|   CONJ+PROG-PART+V    | 1.0000   | 0.0000   | 0.0000   |      1|
| CONJ+PART+PROG-PART+V+PREP+PRON+NEG-PART   |  1.0000  |  0.0000  |  0.0000 |        1|
|          CONJ+PART+PREP+PRON+NEG-PART    | 0.0000   | 1.0000   | 0.0000 |        0|
|                       PREP+PART+PRON    | 1.0000   | 0.0000   | 0.0000    |     1|
|                      CONJ+ADV+NSUFF    | 1.0000   | 0.0000    |0.0000   |      1|
|                           CONJ+ADV    | 0.0000   | 1.0000   | 0.0000  |       0|
|           PART+NOUN+PRON+NEG-PART    | 0.0000   | 1.0000  |  0.0000 |        0|
|                         CONJ+ADJ    | 1.0000   | 1.0000 |   1.0000 |         1|

</details>

- F-score (micro): 0.8974
- F-score (macro): 0.5188
- Accuracy (incl. no class): 0.901  

Expand details below to show class scores for each tag. Note that tag compounds (a tag made for multiple agglutinated parts of speech) are considered as separate ones.

 # Citation
*if you use this model, please consider citing [this work](https://www.researchgate.net/publication/358956953_Sequence_Labeling_Architectures_in_Diglossia_-_a_case_study_of_Arabic_and_its_dialects):*
```latex
@unpublished{MMHU21
author = "M. Megahed",
title = "Sequence Labeling Architectures in Diglossia",
year = {2021},
doi = "10.13140/RG.2.2.34961.10084"
url = {https://www.researchgate.net/publication/358956953_Sequence_Labeling_Architectures_in_Diglossia_-_a_case_study_of_Arabic_and_its_dialects}
}
```