megantosh's picture
Update README.md
f790d35
---
language:
- ar
- en
license: apache-2.0
datasets:
- 4Dialects
- MADAR
- CSCS
thumbnail: https://www.informatik.hu-berlin.de/en/forschung-en/gebiete/ml-en/resolveuid/a6f82e0d7fa446a59c902cac4cafa9cb/@@images/image/preview
tags:
- flair
- token-classification
- sequence-tagger-model
- Dialectal Arabic
- Code-Switching
- Code-Mixing
metrics:
- f1
widget:
- text: "طلعوا جماعة الممانعة بالسياسة ما بيعرفوا ولا بالصحة بيعرفوا ولا حتى بالدين"
- text: "أعلم أن هذا يبدو غير عادل ، لكن لا يمكن أن يكون هناك ظلم"
- text: "أنا عارف أن الموضوع ده شكله مش عادل ، بس لا يمكن أن يكون فيه ظلم"
---
# Arabic Flair + fastText Part-of-Speech tagging Model (Egyptian and Levant)
Pretrained Part-of-Speech tagging model built on a joint corpus written in Egyptian and Levantine (Jordanian, Lebanese, Palestinian, Syrian) dialects with code-switching of Egyptian Arabic and English. The model is trained using [Flair](https://aclanthology.org/C18-1139/) (forward+backward)and [fastText](https://fasttext.cc) embeddings.
# Pretraining Corpora:
This sequence labeling model was pretrained on three corpora jointly:
1. [4 Dialects](https://huggingface.co/datasets/viewer/?dataset=arabic_pos_dialect)
A Dialectal Arabic Datasets containing four dialects of Arabic, Egyptian (EGY), Levantine (LEV), Gulf (GLF), and Maghrebi (MGR). Each dataset consists of a set of 350 manually segmented and PoS tagged tweets.
2. [UD South Levantine Arabic MADAR](https://universaldependencies.org/treebanks/ajp_madar/index.html)
A Dataset with 100 manually-annotated sentences taken from the [MADAR](https://camel.abudhabi.nyu.edu/madar/) (Multi-Arabic Dialect Applications and Resources) project by [Shorouq Zahra](mailto:shorouqjzahra@gmail.com).
3. Parts of the Cairo Students Code-Switch (CSCS) corpus developed for ["Collection and Analysis of Code-switch Egyptian Arabic-English Speech Corpus"](https://aclanthology.org/L18-1601.pdf) by Hamed et al.
# Usage
```python
from flair.data import Sentence
from flair.models import SequenceTagger
tagger = SequenceTagger.load("megantosh/flair-arabic-dialects-codeswitch-egy-lev")
sentence = Sentence('عمرو عادلي أستاذ للاقتصاد السياسي المساعد في الجامعة الأمريكية بالقاهرة .')
tagger.predict(sentence)
for entity in sentence.get_spans('pos'):
print(entity)
```
Due to the right-to-left in left-to-right context, some formatting errors might occur. and your code might appear like [this](https://ibb.co/ky20Lnq), (link accessed on 2020-10-27)
<!--# Example
# Tagset-->
# Scores & Tagset
<details>
| |precision | recall | f1-score | support|
|--|-----------|------|-------------|--------------|
|INTJ | 0.8182 | 0.9000 |0.8571 | 10|
|OUN | 0.9009 | 0.9402 |0.9201 | 435|
|NUM | 0.9524 | 0.8333 | 0.8889 | 24|
|ADJ |0.8762 | 0.7603 | 0.8142 | 121|
|ADP |0.9903 |0.9623 | 0.9761 |106|
| CCONJ | 0.9600 | 0.9730 | 0.9664 | 74|
|PROPN | 0.9333 | 0.9333 | 0.9333 | 15|
| ADV | 0.9135 | 0.8051 | 0.8559 | 118|
|VERB | 0.8852 | 0.9231 | 0.9038 | 117|
|PRON | 0.9620 | 0.9465 | 0.9542 | 187|
|SCONJ | 0.8571 | 0.9474 | 0.9000 | 19|
|PART | 0.9350 | 0.9791 | 0.9565 | 191|
| DET | 0.9348 | 0.9149 | 0.9247 | 47|
|PUNCT | 1.0000 | 1.0000 | 1.0000 | 35|
| AUX | 0.9286 | 0.9811 | 0.9541 | 53|
|MENTION | 0.9231 | 1.0000 | 0.9600 | 12|
| V | 0.8571 | 0.8780 | 0.8675 | 82|
| FUT-PART+V+PREP+PRON |1.0000 | 0.0000 | 0.0000 | 1|
| PROG-PART+V+PRON+PREP+PRON | 0.0000 | 1.0000 | 0.0000 | 0|
|ADJ+NSUFF | 0.6111 | 0.8462 | 0.7097 | 26|
|NOUN+NSUFF | 0.8182 | 0.8438 | 0.8308 | 64|
|PREP+PRON | 0.9565 | 0.9565 | 0.9565 | 23|
| PUNC | 0.9941 | 1.0000 | 0.9971 | 169|
| EOS |1.0000 | 1.0000 | 1.0000 | 70|
| NOUN+PRON | 0.6986 | 0.8500 | 0.7669 | 60|
| V+PRON | 0.7258 | 0.8036 | 0.7627 | 56|
| PART+PRON | 1.0000 | 0.9474 | 0.9730 | 19|
| PROG-PART+V | 0.8333 | 0.9302 | 0.8791 | 43|
| DET+NOUN | 0.9625 | 1.0000 | 0.9809 | 77|
| NOUN+NSUFF+PRON | 0.9091 | 0.7143 | 0.8000 | 14|
| PROG-PART+V+PRON | 0.7083 | 0.9444 | 0.8095 | 18|
| PREP+NOUN+NSUFF | 0.6667 | 0.4000 | 0.5000 5|
| NOUN+NSUFF+NSUFF | 1.0000 | 0.0000 | 0.0000 | 3|
| CONJ | 0.9722 | 1.0000 | 0.9859 | 35|
| V+PRON+PRON | 0.6364 | 0.5833 | 0.6087 | 12|
| FOREIGN | 0.6667 | 0.6667 | 0.6667 | 3|
| PREP+NOUN | 0.6316 | 0.7500 | 0.6857 | 16|
| DET+NOUN+NSUFF | 0.9000 | 0.9310 | 0.9153 | 29|
| DET+ADJ+NSUFF | 1.0000 | 0.5714 | 0.7273 | 7|
| CONJ+PRON | 1.0000 | 0.8750 | 0.9333 | 8|
| NOUN+CASE | 0.0000 | 0.0000 | 0.0000 | 2|
| DET+ADJ | 1.0000 | 0.6667 | 0.8000 | 6|
| PREP | 1.0000 | 0.9718 | 0.9857 | 71|
| CONJ+FUT-PART+V | 0.0000 | 0.0000 | 0.0000 | 1|
| CONJ+V | 0.6667 | 0.7500 | 0.7059 | 8|
| FUT-PART | 1.0000 | 1.0000 | 1.0000 | 2|
| ADJ+PRON | 1.0000 | 0.0000 | 0.0000 | 8|
| CONJ+PREP+NOUN+PRON | 1.0000 | 0.0000 | 0.0000 | 1|
| CONJ+NOUN+PRON | 0.3750 | 1.0000 | 0.5455 | 3|
| PART+ADJ | 1.0000 | 0.0000 | 0.0000 | 1|
| PART+NOUN | 0.5000 | 1.0000 | 0.6667 | 1|
| CONJ+PREP+NOUN | 1.0000 | 0.0000 | 0.0000 | 1|
| CONJ+NOUN | 0.7000 | 0.7778 | 0.7368 | 9|
| URL | 1.0000 | 1.0000 | 1.0000 | 3|
| CONJ+FUT-PART | 1.0000 | 0.0000 | 0.0000 | 1|
| FUT-PART+V | 0.8571 | 0.6000 | 0.7059 | 10|
| PREP+NOUN+NSUFF+NSUFF | 1.0000 | 0.0000 | 0.0000 | 1|
| HASH | 1.0000 | 0.9412 | 0.9697 | 17|
| ADJ+PREP+PRON | 1.0000 | 0.0000 | 0.0000 | 3|
| PREP+NOUN+PRON | 0.0000 | 0.0000 | 0.0000 | 1|
| EMOT | 1.0000 | 0.8889 | 0.9412 | 18|
| CONJ+PREP | 1.0000 | 0.7500 | 0.8571 | 4|
| PREP+DET+NOUN+NSUFF | 1.0000 | 0.7500 | 0.8571 | 4|
| PRON+DET+NOUN+NSUFF | 0.0000 | 1.0000 | 0.0000 | 0|
| V+PREP+PRON | 1.0000 | 0.0000 | 0.0000 | 5|
| V+PRON+PREP+PRON | 0.0000 | 1.0000 | 0.0000 | 0|
| CONJ+NOUN+NSUFF | 0.5000 | 0.5000 | 0.5000 | 2|
| V+NEG-PART | 1.0000 | 0.0000 | 0.0000 | 2|
| PREP+DET+NOUN | 0.9091 | 1.0000 | 0.9524 | 10|
| PREP+V | 1.0000 | 0.0000 | 0.0000 | 2|
| CONJ+PART | 1.0000 | 0.7778 | 0.8750 | 9|
| CONJ+V+PRON | 1.0000 | 1.0000 | 1.0000 | 5|
| PROG-PART+V+PREP+PRON | 1.0000 | 0.5000 | 0.6667 | 2|
| PREP+NOUN+NSUFF+PRON | 1.0000 | 1.0000 | 1.0000 | 1|
| ADJ+CASE | 1.0000 | 0.0000 | 0.0000 | 1|
| PART+NOUN+PRON | 1.0000 | 1.0000 | 1.0000 | 1|
| PART+V | 1.0000 | 0.0000 | 0.0000 | 3|
| PART+V+PRON | 0.0000 | 1.0000 | 0.0000 | 0|
| FUT-PART+V+PRON | 0.0000 | 1.0000 | 0.0000 | 0|
|FUT-PART+V+PRON+PRON | 1.0000 | 0.0000 | 0.0000 | 1|
| CONJ+PREP+PRON | 1.0000 | 0.0000 | 0.0000 | 1|
|CONJ+V+PRON+PREP+PRON | 1.0000 | 0.0000 | 0.0000 | 1|
| CONJ+V+PREP+PRON | 0.0000 | 1.0000 | 0.0000 | 0|
|CONJ+DET+NOUN+NSUFF | 1.0000 | 0.0000 | 0.0000 | 1|
| CONJ+DET+NOUN | 0.6667 | 1.0000 | 0.8000 | 2|
| CONJ+PREP+DET+NOUN | 1.0000 | 1.0000 | 1.0000 | 1|
| PREP+PART | 1.0000 | 0.0000 | 0.0000 | 2|
| PART+V+PRON+NEG-PART | 0.3333 | 0.3333 | 0.3333 | 3|
| PART+V+NEG-PART | 0.3333 | 0.5000 | 0.4000 | 2|
| PART+PREP+NEG-PART | 1.0000 | 1.0000 | 1.0000 | 3|
| PART+PROG-PART+V+NEG-PART | 1.0000 | 0.3333 | 0.5000 | 3|
| PREP+DET+NOUN+NSUFF+PREP+PRON | 1.0000 | 0.0000 | 0.0000 | 1|
| PREP+PRON+DET+NOUN | 0.0000 | 1.0000 | 0.0000 | 0|
| PART+NSUFF | 1.0000 | 0.0000 | 0.0000 | 1|
| CONJ+PROG-PART+V+PRON | 1.0000 | 1.0000 | 1.0000 | 1|
| PART+PREP+PRON | 1.0000 | 0.0000 | 0.0000 | 1|
| CONJ+PART+PREP | 1.0000 | 0.0000 | 0.0000 | 1|
| NUM+NSUFF | 0.6667 | 0.6667 | 0.6667 | 3|
| CONJ+PART+V+PRON+NEG-PART | 1.0000 | 1.0000 | 1.0000 | 1|
| PART+NOUN+NEG-PART | 1.0000 | 1.0000 | 1.0000 | 1|
| CONJ+ADJ+NSUFF | 1.0000 | 0.0000 | 0.0000 | 1|
| PREP+ADJ | 1.0000 | 0.0000 | 0.0000 | 1|
| ADJ+NSUFF+PRON | 1.0000 | 0.0000 | 0.0000 | 2|
| CONJ+PROG-PART+V | 1.0000 | 0.0000 | 0.0000 | 1|
| CONJ+PART+PROG-PART+V+PREP+PRON+NEG-PART | 1.0000 | 0.0000 | 0.0000 | 1|
| CONJ+PART+PREP+PRON+NEG-PART | 0.0000 | 1.0000 | 0.0000 | 0|
| PREP+PART+PRON | 1.0000 | 0.0000 | 0.0000 | 1|
| CONJ+ADV+NSUFF | 1.0000 | 0.0000 |0.0000 | 1|
| CONJ+ADV | 0.0000 | 1.0000 | 0.0000 | 0|
| PART+NOUN+PRON+NEG-PART | 0.0000 | 1.0000 | 0.0000 | 0|
| CONJ+ADJ | 1.0000 | 1.0000 | 1.0000 | 1|
</details>
- F-score (micro): 0.8974
- F-score (macro): 0.5188
- Accuracy (incl. no class): 0.901
Expand details below to show class scores for each tag. Note that tag compounds (a tag made for multiple agglutinated parts of speech) are considered as separate ones.
# Citation
*if you use this model, please consider citing [this work](https://www.researchgate.net/publication/358956953_Sequence_Labeling_Architectures_in_Diglossia_-_a_case_study_of_Arabic_and_its_dialects):*
```latex
@unpublished{MMHU21
author = "M. Megahed",
title = "Sequence Labeling Architectures in Diglossia",
year = {2021},
doi = "10.13140/RG.2.2.34961.10084"
url = {https://www.researchgate.net/publication/358956953_Sequence_Labeling_Architectures_in_Diglossia_-_a_case_study_of_Arabic_and_its_dialects}
}
```