File size: 11,067 Bytes
f80b98d c1087c7 b2469f6 f80b98d b2469f6 f80b98d 3858726 05316b4 9d9fe49 f80b98d 971113c f80b98d 971113c f3a0748 971113c 483d962 941b920 971113c 941b920 2cdbc37 bd851f0 f790d35 971113c d1c87fa 971113c f790d35 971113c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 |
---
language:
- ar
- en
license: apache-2.0
datasets:
- 4Dialects
- MADAR
- CSCS
thumbnail: https://www.informatik.hu-berlin.de/en/forschung-en/gebiete/ml-en/resolveuid/a6f82e0d7fa446a59c902cac4cafa9cb/@@images/image/preview
tags:
- flair
- token-classification
- sequence-tagger-model
- Dialectal Arabic
- Code-Switching
- Code-Mixing
metrics:
- f1
widget:
- text: "طلعوا جماعة الممانعة بالسياسة ما بيعرفوا ولا بالصحة بيعرفوا ولا حتى بالدين"
- text: "أعلم أن هذا يبدو غير عادل ، لكن لا يمكن أن يكون هناك ظلم"
- text: "أنا عارف أن الموضوع ده شكله مش عادل ، بس لا يمكن أن يكون فيه ظلم"
---
# Arabic Flair + fastText Part-of-Speech tagging Model (Egyptian and Levant)
Pretrained Part-of-Speech tagging model built on a joint corpus written in Egyptian and Levantine (Jordanian, Lebanese, Palestinian, Syrian) dialects with code-switching of Egyptian Arabic and English. The model is trained using [Flair](https://aclanthology.org/C18-1139/) (forward+backward)and [fastText](https://fasttext.cc) embeddings.
# Pretraining Corpora:
This sequence labeling model was pretrained on three corpora jointly:
1. [4 Dialects](https://huggingface.co/datasets/viewer/?dataset=arabic_pos_dialect)
A Dialectal Arabic Datasets containing four dialects of Arabic, Egyptian (EGY), Levantine (LEV), Gulf (GLF), and Maghrebi (MGR). Each dataset consists of a set of 350 manually segmented and PoS tagged tweets.
2. [UD South Levantine Arabic MADAR](https://universaldependencies.org/treebanks/ajp_madar/index.html)
A Dataset with 100 manually-annotated sentences taken from the [MADAR](https://camel.abudhabi.nyu.edu/madar/) (Multi-Arabic Dialect Applications and Resources) project by [Shorouq Zahra](mailto:shorouqjzahra@gmail.com).
3. Parts of the Cairo Students Code-Switch (CSCS) corpus developed for ["Collection and Analysis of Code-switch Egyptian Arabic-English Speech Corpus"](https://aclanthology.org/L18-1601.pdf) by Hamed et al.
# Usage
```python
from flair.data import Sentence
from flair.models import SequenceTagger
tagger = SequenceTagger.load("megantosh/flair-arabic-dialects-codeswitch-egy-lev")
sentence = Sentence('عمرو عادلي أستاذ للاقتصاد السياسي المساعد في الجامعة الأمريكية بالقاهرة .')
tagger.predict(sentence)
for entity in sentence.get_spans('pos'):
print(entity)
```
Due to the right-to-left in left-to-right context, some formatting errors might occur. and your code might appear like [this](https://ibb.co/ky20Lnq), (link accessed on 2020-10-27)
<!--# Example
# Tagset-->
# Scores & Tagset
<details>
| |precision | recall | f1-score | support|
|--|-----------|------|-------------|--------------|
|INTJ | 0.8182 | 0.9000 |0.8571 | 10|
|OUN | 0.9009 | 0.9402 |0.9201 | 435|
|NUM | 0.9524 | 0.8333 | 0.8889 | 24|
|ADJ |0.8762 | 0.7603 | 0.8142 | 121|
|ADP |0.9903 |0.9623 | 0.9761 |106|
| CCONJ | 0.9600 | 0.9730 | 0.9664 | 74|
|PROPN | 0.9333 | 0.9333 | 0.9333 | 15|
| ADV | 0.9135 | 0.8051 | 0.8559 | 118|
|VERB | 0.8852 | 0.9231 | 0.9038 | 117|
|PRON | 0.9620 | 0.9465 | 0.9542 | 187|
|SCONJ | 0.8571 | 0.9474 | 0.9000 | 19|
|PART | 0.9350 | 0.9791 | 0.9565 | 191|
| DET | 0.9348 | 0.9149 | 0.9247 | 47|
|PUNCT | 1.0000 | 1.0000 | 1.0000 | 35|
| AUX | 0.9286 | 0.9811 | 0.9541 | 53|
|MENTION | 0.9231 | 1.0000 | 0.9600 | 12|
| V | 0.8571 | 0.8780 | 0.8675 | 82|
| FUT-PART+V+PREP+PRON |1.0000 | 0.0000 | 0.0000 | 1|
| PROG-PART+V+PRON+PREP+PRON | 0.0000 | 1.0000 | 0.0000 | 0|
|ADJ+NSUFF | 0.6111 | 0.8462 | 0.7097 | 26|
|NOUN+NSUFF | 0.8182 | 0.8438 | 0.8308 | 64|
|PREP+PRON | 0.9565 | 0.9565 | 0.9565 | 23|
| PUNC | 0.9941 | 1.0000 | 0.9971 | 169|
| EOS |1.0000 | 1.0000 | 1.0000 | 70|
| NOUN+PRON | 0.6986 | 0.8500 | 0.7669 | 60|
| V+PRON | 0.7258 | 0.8036 | 0.7627 | 56|
| PART+PRON | 1.0000 | 0.9474 | 0.9730 | 19|
| PROG-PART+V | 0.8333 | 0.9302 | 0.8791 | 43|
| DET+NOUN | 0.9625 | 1.0000 | 0.9809 | 77|
| NOUN+NSUFF+PRON | 0.9091 | 0.7143 | 0.8000 | 14|
| PROG-PART+V+PRON | 0.7083 | 0.9444 | 0.8095 | 18|
| PREP+NOUN+NSUFF | 0.6667 | 0.4000 | 0.5000 5|
| NOUN+NSUFF+NSUFF | 1.0000 | 0.0000 | 0.0000 | 3|
| CONJ | 0.9722 | 1.0000 | 0.9859 | 35|
| V+PRON+PRON | 0.6364 | 0.5833 | 0.6087 | 12|
| FOREIGN | 0.6667 | 0.6667 | 0.6667 | 3|
| PREP+NOUN | 0.6316 | 0.7500 | 0.6857 | 16|
| DET+NOUN+NSUFF | 0.9000 | 0.9310 | 0.9153 | 29|
| DET+ADJ+NSUFF | 1.0000 | 0.5714 | 0.7273 | 7|
| CONJ+PRON | 1.0000 | 0.8750 | 0.9333 | 8|
| NOUN+CASE | 0.0000 | 0.0000 | 0.0000 | 2|
| DET+ADJ | 1.0000 | 0.6667 | 0.8000 | 6|
| PREP | 1.0000 | 0.9718 | 0.9857 | 71|
| CONJ+FUT-PART+V | 0.0000 | 0.0000 | 0.0000 | 1|
| CONJ+V | 0.6667 | 0.7500 | 0.7059 | 8|
| FUT-PART | 1.0000 | 1.0000 | 1.0000 | 2|
| ADJ+PRON | 1.0000 | 0.0000 | 0.0000 | 8|
| CONJ+PREP+NOUN+PRON | 1.0000 | 0.0000 | 0.0000 | 1|
| CONJ+NOUN+PRON | 0.3750 | 1.0000 | 0.5455 | 3|
| PART+ADJ | 1.0000 | 0.0000 | 0.0000 | 1|
| PART+NOUN | 0.5000 | 1.0000 | 0.6667 | 1|
| CONJ+PREP+NOUN | 1.0000 | 0.0000 | 0.0000 | 1|
| CONJ+NOUN | 0.7000 | 0.7778 | 0.7368 | 9|
| URL | 1.0000 | 1.0000 | 1.0000 | 3|
| CONJ+FUT-PART | 1.0000 | 0.0000 | 0.0000 | 1|
| FUT-PART+V | 0.8571 | 0.6000 | 0.7059 | 10|
| PREP+NOUN+NSUFF+NSUFF | 1.0000 | 0.0000 | 0.0000 | 1|
| HASH | 1.0000 | 0.9412 | 0.9697 | 17|
| ADJ+PREP+PRON | 1.0000 | 0.0000 | 0.0000 | 3|
| PREP+NOUN+PRON | 0.0000 | 0.0000 | 0.0000 | 1|
| EMOT | 1.0000 | 0.8889 | 0.9412 | 18|
| CONJ+PREP | 1.0000 | 0.7500 | 0.8571 | 4|
| PREP+DET+NOUN+NSUFF | 1.0000 | 0.7500 | 0.8571 | 4|
| PRON+DET+NOUN+NSUFF | 0.0000 | 1.0000 | 0.0000 | 0|
| V+PREP+PRON | 1.0000 | 0.0000 | 0.0000 | 5|
| V+PRON+PREP+PRON | 0.0000 | 1.0000 | 0.0000 | 0|
| CONJ+NOUN+NSUFF | 0.5000 | 0.5000 | 0.5000 | 2|
| V+NEG-PART | 1.0000 | 0.0000 | 0.0000 | 2|
| PREP+DET+NOUN | 0.9091 | 1.0000 | 0.9524 | 10|
| PREP+V | 1.0000 | 0.0000 | 0.0000 | 2|
| CONJ+PART | 1.0000 | 0.7778 | 0.8750 | 9|
| CONJ+V+PRON | 1.0000 | 1.0000 | 1.0000 | 5|
| PROG-PART+V+PREP+PRON | 1.0000 | 0.5000 | 0.6667 | 2|
| PREP+NOUN+NSUFF+PRON | 1.0000 | 1.0000 | 1.0000 | 1|
| ADJ+CASE | 1.0000 | 0.0000 | 0.0000 | 1|
| PART+NOUN+PRON | 1.0000 | 1.0000 | 1.0000 | 1|
| PART+V | 1.0000 | 0.0000 | 0.0000 | 3|
| PART+V+PRON | 0.0000 | 1.0000 | 0.0000 | 0|
| FUT-PART+V+PRON | 0.0000 | 1.0000 | 0.0000 | 0|
|FUT-PART+V+PRON+PRON | 1.0000 | 0.0000 | 0.0000 | 1|
| CONJ+PREP+PRON | 1.0000 | 0.0000 | 0.0000 | 1|
|CONJ+V+PRON+PREP+PRON | 1.0000 | 0.0000 | 0.0000 | 1|
| CONJ+V+PREP+PRON | 0.0000 | 1.0000 | 0.0000 | 0|
|CONJ+DET+NOUN+NSUFF | 1.0000 | 0.0000 | 0.0000 | 1|
| CONJ+DET+NOUN | 0.6667 | 1.0000 | 0.8000 | 2|
| CONJ+PREP+DET+NOUN | 1.0000 | 1.0000 | 1.0000 | 1|
| PREP+PART | 1.0000 | 0.0000 | 0.0000 | 2|
| PART+V+PRON+NEG-PART | 0.3333 | 0.3333 | 0.3333 | 3|
| PART+V+NEG-PART | 0.3333 | 0.5000 | 0.4000 | 2|
| PART+PREP+NEG-PART | 1.0000 | 1.0000 | 1.0000 | 3|
| PART+PROG-PART+V+NEG-PART | 1.0000 | 0.3333 | 0.5000 | 3|
| PREP+DET+NOUN+NSUFF+PREP+PRON | 1.0000 | 0.0000 | 0.0000 | 1|
| PREP+PRON+DET+NOUN | 0.0000 | 1.0000 | 0.0000 | 0|
| PART+NSUFF | 1.0000 | 0.0000 | 0.0000 | 1|
| CONJ+PROG-PART+V+PRON | 1.0000 | 1.0000 | 1.0000 | 1|
| PART+PREP+PRON | 1.0000 | 0.0000 | 0.0000 | 1|
| CONJ+PART+PREP | 1.0000 | 0.0000 | 0.0000 | 1|
| NUM+NSUFF | 0.6667 | 0.6667 | 0.6667 | 3|
| CONJ+PART+V+PRON+NEG-PART | 1.0000 | 1.0000 | 1.0000 | 1|
| PART+NOUN+NEG-PART | 1.0000 | 1.0000 | 1.0000 | 1|
| CONJ+ADJ+NSUFF | 1.0000 | 0.0000 | 0.0000 | 1|
| PREP+ADJ | 1.0000 | 0.0000 | 0.0000 | 1|
| ADJ+NSUFF+PRON | 1.0000 | 0.0000 | 0.0000 | 2|
| CONJ+PROG-PART+V | 1.0000 | 0.0000 | 0.0000 | 1|
| CONJ+PART+PROG-PART+V+PREP+PRON+NEG-PART | 1.0000 | 0.0000 | 0.0000 | 1|
| CONJ+PART+PREP+PRON+NEG-PART | 0.0000 | 1.0000 | 0.0000 | 0|
| PREP+PART+PRON | 1.0000 | 0.0000 | 0.0000 | 1|
| CONJ+ADV+NSUFF | 1.0000 | 0.0000 |0.0000 | 1|
| CONJ+ADV | 0.0000 | 1.0000 | 0.0000 | 0|
| PART+NOUN+PRON+NEG-PART | 0.0000 | 1.0000 | 0.0000 | 0|
| CONJ+ADJ | 1.0000 | 1.0000 | 1.0000 | 1|
</details>
- F-score (micro): 0.8974
- F-score (macro): 0.5188
- Accuracy (incl. no class): 0.901
Expand details below to show class scores for each tag. Note that tag compounds (a tag made for multiple agglutinated parts of speech) are considered as separate ones.
# Citation
*if you use this model, please consider citing [this work](https://www.researchgate.net/publication/358956953_Sequence_Labeling_Architectures_in_Diglossia_-_a_case_study_of_Arabic_and_its_dialects):*
```latex
@unpublished{MMHU21
author = "M. Megahed",
title = "Sequence Labeling Architectures in Diglossia",
year = {2021},
doi = "10.13140/RG.2.2.34961.10084"
url = {https://www.researchgate.net/publication/358956953_Sequence_Labeling_Architectures_in_Diglossia_-_a_case_study_of_Arabic_and_its_dialects}
}
``` |