File size: 11,067 Bytes

---
language: 
- ar
- en
license: apache-2.0
datasets:
- 4Dialects
- MADAR
- CSCS
thumbnail: https://www.informatik.hu-berlin.de/en/forschung-en/gebiete/ml-en/resolveuid/a6f82e0d7fa446a59c902cac4cafa9cb/@@images/image/preview
tags:
- flair
- token-classification
- sequence-tagger-model
- Dialectal Arabic
- Code-Switching
- Code-Mixing
metrics:
- f1
widget:
- text: "طلعوا جماعة الممانعة بالسياسة ما بيعرفوا ولا بالصحة بيعرفوا ولا حتى بالدين"
- text: "أعلم أن هذا يبدو غير عادل ، لكن لا يمكن أن يكون هناك ظلم"
- text: "أنا عارف أن الموضوع ده شكله مش عادل ، بس لا يمكن أن يكون فيه ظلم"
---


# Arabic Flair + fastText Part-of-Speech tagging Model (Egyptian and Levant)
Pretrained Part-of-Speech tagging model built on a joint corpus written in Egyptian and Levantine (Jordanian, Lebanese, Palestinian, Syrian) dialects with code-switching of Egyptian Arabic and English. The model is trained using [Flair](https://aclanthology.org/C18-1139/) (forward+backward)and [fastText](https://fasttext.cc) embeddings.



# Pretraining Corpora:
This sequence labeling model was pretrained on three corpora jointly:
1. [4 Dialects](https://huggingface.co/datasets/viewer/?dataset=arabic_pos_dialect)
A Dialectal Arabic Datasets containing four dialects of Arabic, Egyptian (EGY), Levantine (LEV), Gulf (GLF), and Maghrebi (MGR). Each dataset consists of a set of 350 manually segmented and PoS tagged tweets.
2. [UD South Levantine Arabic MADAR](https://universaldependencies.org/treebanks/ajp_madar/index.html)
A Dataset with 100 manually-annotated sentences taken from the [MADAR](https://camel.abudhabi.nyu.edu/madar/) (Multi-Arabic Dialect Applications and Resources) project by [Shorouq Zahra](mailto:shorouqjzahra@gmail.com).
3. Parts of the Cairo Students Code-Switch (CSCS) corpus developed for ["Collection and Analysis of Code-switch Egyptian Arabic-English Speech Corpus"](https://aclanthology.org/L18-1601.pdf) by Hamed et al.

# Usage
```python
from flair.data import Sentence
from flair.models import SequenceTagger
  
tagger = SequenceTagger.load("megantosh/flair-arabic-dialects-codeswitch-egy-lev")
sentence = Sentence('عمرو عادلي أستاذ للاقتصاد السياسي المساعد في الجامعة الأمريكية  بالقاهرة .')
tagger.predict(sentence)
for entity in sentence.get_spans('pos'):
    print(entity)
```

Due to the right-to-left in left-to-right context, some formatting errors might occur. and your code might appear like [this](https://ibb.co/ky20Lnq), (link accessed on 2020-10-27) 

<!--# Example

# Tagset-->

# Scores & Tagset
<details> 

| |precision |   recall | f1-score |  support|
|--|-----------|------|-------------|--------------|
|INTJ |    0.8182   | 0.9000    |0.8571    |    10|
|OUN   |  0.9009   | 0.9402    |0.9201      | 435|
|NUM    | 0.9524   | 0.8333   | 0.8889       | 24|
|ADJ     |0.8762   | 0.7603  |  0.8142      | 121|
|ADP     |0.9903    |0.9623 |   0.9761       |106|
| CCONJ |    0.9600   | 0.9730 |   0.9664 |       74|
|PROPN |    0.9333   | 0.9333  |  0.9333  |      15|
| ADV  |   0.9135   | 0.8051  |  0.8559   |    118|
|VERB   |  0.8852    | 0.9231 |   0.9038   |    117|
|PRON    | 0.9620    | 0.9465 |   0.9542    |   187|
|SCONJ |    0.8571   | 0.9474  |  0.9000      |  19|
|PART  |   0.9350   | 0.9791   | 0.9565       | 191|
| DET   |  0.9348    | 0.9149  |  0.9247 |       47|
|PUNCT    | 1.0000    | 1.0000  |  1.0000  |      35|
| AUX  |   0.9286    | 0.9811  |  0.9541   |     53|
|MENTION   |  0.9231   |  1.0000  |  0.9600    |    12|
|     V    | 0.8571   | 0.8780    | 0.8675     |   82|
| FUT-PART+V+PREP+PRON     |1.0000   | 0.0000   | 0.0000       |  1|
|  PROG-PART+V+PRON+PREP+PRON |     0.0000  |  1.0000  |  0.0000       |  0|
|ADJ+NSUFF |    0.6111   | 0.8462   | 0.7097 |       26|
|NOUN+NSUFF  |   0.8182   | 0.8438   | 0.8308  |      64|
|PREP+PRON   |  0.9565   | 0.9565   | 0.9565   |     23|
|                   PUNC    | 0.9941   | 1.0000   | 0.9971    |   169|
|                    EOS     |1.0000   | 1.0000   | 1.0000    |   70|
|             NOUN+PRON   |  0.6986   | 0.8500   | 0.7669      |  60|
|                V+PRON    | 0.7258   | 0.8036   | 0.7627       | 56|
|            PART+PRON    | 1.0000   | 0.9474   | 0.9730    |    19|
|          PROG-PART+V    | 0.8333   | 0.9302   | 0.8791 |       43|
|            DET+NOUN    | 0.9625   | 1.0000   | 0.9809  |      77|
|     NOUN+NSUFF+PRON    | 0.9091   | 0.7143   | 0.8000   |     14|
|     PROG-PART+V+PRON    | 0.7083   | 0.9444   | 0.8095    |    18|
|      PREP+NOUN+NSUFF    | 0.6667   | 0.4000   | 0.5000         5|
|     NOUN+NSUFF+NSUFF    | 1.0000   | 0.0000   | 0.0000 |        3|
|                CONJ    | 0.9722   | 1.0000   | 0.9859  |      35|
|        V+PRON+PRON    | 0.6364   | 0.5833   | 0.6087   |     12|
|           FOREIGN    | 0.6667   | 0.6667   | 0.6667    |     3|
|        PREP+NOUN    | 0.6316   | 0.7500  |  0.6857 |       16|
|  DET+NOUN+NSUFF    | 0.9000   | 0.9310  |  0.9153  |      29|
|  DET+ADJ+NSUFF    | 1.0000   | 0.5714  |  0.7273   |      7|
|     CONJ+PRON    | 1.0000   | 0.8750  |  0.9333     |    8|
|    NOUN+CASE    | 0.0000   | 0.0000  |  0.0000    |     2|
|     DET+ADJ    | 1.0000   | 0.6667  |  0.8000      |   6|
|       PREP    | 1.0000   | 0.9718  |  0.9857  |      71|
|  CONJ+FUT-PART+V    | 0.0000   | 0.0000  |  0.0000   |      1|
|            CONJ+V    | 0.6667   | 0.7500  |  0.7059    |     8|
|         FUT-PART    | 1.0000   | 1.0000  |  1.0000     |    2|
|             ADJ+PRON    | 1.0000   | 0.0000  |  0.0000      |   8|
|   CONJ+PREP+NOUN+PRON    | 1.0000   | 0.0000  |  0.0000       |  1|
|        CONJ+NOUN+PRON    | 0.3750   | 1.0000  |  0.5455      |   3|
|              PART+ADJ    | 1.0000   | 0.0000  |  0.0000       |  1|
|             PART+NOUN    | 0.5000   | 1.0000  |  0.6667        | 1|
|       CONJ+PREP+NOUN    | 1.0000   | 0.0000  |  0.0000       |  1|
|           CONJ+NOUN    | 0.7000   | 0.7778  |  0.7368  |       9|
|                URL    | 1.0000   | 1.0000   | 1.0000 |        3|
|     CONJ+FUT-PART    | 1.0000   | 0.0000   | 0.0000  |       1|
|       FUT-PART+V    | 0.8571   | 0.6000   | 0.7059   |     10|
|      PREP+NOUN+NSUFF+NSUFF    | 1.0000   | 0.0000    | 0.0000   |      1|
|                      HASH    | 1.0000   | 0.9412   | 0.9697     |   17|
|            ADJ+PREP+PRON    | 1.0000   | 0.0000   | 0.0000  |       3|
|          PREP+NOUN+PRON    | 0.0000   | 0.0000   | 0.0000   |      1|
|                   EMOT    | 1.0000   | 0.8889   | 0.9412    |    18|
|             CONJ+PREP    | 1.0000   | 0.7500   | 0.8571     |    4|
|  PREP+DET+NOUN+NSUFF    | 1.0000   | 0.7500   | 0.8571      |   4|
| PRON+DET+NOUN+NSUFF    | 0.0000   | 1.0000   | 0.0000       |  0|
|        V+PREP+PRON    | 1.0000   | 0.0000   | 0.0000        | 5|
|  V+PRON+PREP+PRON    | 0.0000   | 1.0000   | 0.0000         | 0|
|  CONJ+NOUN+NSUFF    | 0.5000   | 0.5000   | 0.5000 |        2|
|      V+NEG-PART    | 1.0000   | 0.0000   | 0.0000  |       2|
|  PREP+DET+NOUN    | 0.9091   | 1.0000   | 0.9524   |     10|
|        PREP+V    | 1.0000   | 0.0000   | 0.0000    |     2|
|    CONJ+PART    | 1.0000   | 0.7778   | 0.8750     |    9|
| CONJ+V+PRON    | 1.0000   | 1.0000   | 1.0000 |        5|
|    PROG-PART+V+PREP+PRON    | 1.0000   | 0.5000   | 0.6667  |       2|
|    PREP+NOUN+NSUFF+PRON    | 1.0000   | 1.0000   | 1.0000   |      1|
|               ADJ+CASE    | 1.0000   | 0.0000    | 0.0000   |      1|
|        PART+NOUN+PRON    | 1.0000   | 1.0000   | 1.0000     |    1|
|               PART+V    | 1.0000   | 0.0000  |  0.0000      |   3|
|         PART+V+PRON    | 0.0000   | 1.0000  |  0.0000       |  0|
|    FUT-PART+V+PRON    | 0.0000   | 1.0000  |  0.0000        | 0|
|FUT-PART+V+PRON+PRON    | 1.0000   | 0.0000  |  0.0000  |       1|
|     CONJ+PREP+PRON    | 1.0000   | 0.0000  |  0.0000   |      1|
|CONJ+V+PRON+PREP+PRON    | 1.0000   | 0.0000  |  0.0000    |     1|
|    CONJ+V+PREP+PRON    | 0.0000   | 1.0000  |  0.0000     |    0|
|CONJ+DET+NOUN+NSUFF    | 1.0000   | 0.0000  |  0.0000      |   1|
|     CONJ+DET+NOUN    | 0.6667   | 1.0000  |  0.8000    |     2|
| CONJ+PREP+DET+NOUN   |  1.0000  |  1.0000 |   1.0000  |       1|
|       PREP+PART    | 1.0000   | 0.0000  |  0.0000  |       2|
|      PART+V+PRON+NEG-PART    | 0.3333   | 0.3333  |  0.3333         | 3|
|          PART+V+NEG-PART    | 0.3333   | 0.5000  |  0.4000        | 2|
|      PART+PREP+NEG-PART    | 1.0000   | 1.0000  |  1.0000       |  3|
| PART+PROG-PART+V+NEG-PART    | 1.0000   | 0.3333   | 0.5000      |   3|
| PREP+DET+NOUN+NSUFF+PREP+PRON   |  1.0000  |  0.0000  |  0.0000    |     1|
|         PREP+PRON+DET+NOUN    | 0.0000   | 1.0000    | 0.0000   |      0|
|                PART+NSUFF    | 1.0000   | 0.0000    | 0.0000  |       1|
|    CONJ+PROG-PART+V+PRON    | 1.0000   | 1.0000   | 1.0000    |     1|
|          PART+PREP+PRON    | 1.0000   | 0.0000   | 0.0000   |      1|
|         CONJ+PART+PREP    | 1.0000   | 0.0000    | 0.0000        | 1|
|             NUM+NSUFF    | 0.6667   | 0.6667   | 0.6667        | 3|
| CONJ+PART+V+PRON+NEG-PART   |  1.0000  |  1.0000  |  1.0000      |   1|
|     PART+NOUN+NEG-PART    | 1.0000   | 1.0000   | 1.0000      |   1|
|        CONJ+ADJ+NSUFF     | 1.0000  |  0.0000  |  0.0000    |     1|
|             PREP+ADJ     | 1.0000  |  0.0000  |  0.0000   |      1|
|      ADJ+NSUFF+PRON     | 1.0000  |  0.0000  |  0.0000  |       2|
|   CONJ+PROG-PART+V    | 1.0000   | 0.0000   | 0.0000   |      1|
| CONJ+PART+PROG-PART+V+PREP+PRON+NEG-PART   |  1.0000  |  0.0000  |  0.0000 |        1|
|          CONJ+PART+PREP+PRON+NEG-PART    | 0.0000   | 1.0000   | 0.0000 |        0|
|                       PREP+PART+PRON    | 1.0000   | 0.0000   | 0.0000    |     1|
|                      CONJ+ADV+NSUFF    | 1.0000   | 0.0000    |0.0000   |      1|
|                           CONJ+ADV    | 0.0000   | 1.0000   | 0.0000  |       0|
|           PART+NOUN+PRON+NEG-PART    | 0.0000   | 1.0000  |  0.0000 |        0|
|                         CONJ+ADJ    | 1.0000   | 1.0000 |   1.0000 |         1|

</details>

- F-score (micro): 0.8974
- F-score (macro): 0.5188
- Accuracy (incl. no class): 0.901  

Expand details below to show class scores for each tag. Note that tag compounds (a tag made for multiple agglutinated parts of speech) are considered as separate ones.

 # Citation
*if you use this model, please consider citing [this work](https://www.researchgate.net/publication/358956953_Sequence_Labeling_Architectures_in_Diglossia_-_a_case_study_of_Arabic_and_its_dialects):*
```latex
@unpublished{MMHU21
author = "M. Megahed",
title = "Sequence Labeling Architectures in Diglossia",
year = {2021},
doi = "10.13140/RG.2.2.34961.10084"
url = {https://www.researchgate.net/publication/358956953_Sequence_Labeling_Architectures_in_Diglossia_-_a_case_study_of_Arabic_and_its_dialects}
}
```