--- language: - ar - en license: apache-2.0 datasets: - 4Dialects - MADAR - CSCS thumbnail: https://www.informatik.hu-berlin.de/en/forschung-en/gebiete/ml-en/resolveuid/a6f82e0d7fa446a59c902cac4cafa9cb/@@images/image/preview tags: - flair - PoS-Tagging - sequence labeling - Token Classification - Dialectal Arabic - Code-Switching - Code-mixing metrics: - f1 widget: - text: "لائحة «الوطنية للصحافة».. خطوة جديدة في طريق «الحصار»" --- # Arabic Flair + fastText Part-of-Speech tagging Model (Egyptian and Levant) Pretrained Part-of-Speech tagging model built on a joint corpus written in Egyptian and Levantine (Jordanian, Lebanese, Palestinian, Syrian) dialects with code-switching of Egyptian Arabic and English. The model is trained using [Flair](https://aclanthology.org/C18-1139/) (forward+backward)and [fastText](https://fasttext.cc) embeddings. # Pretraining Corpora: This sequence labeling model was pretrained on three corpora jointly: 1. [4 Dialects](https://huggingface.co/datasets/viewer/?dataset=arabic_pos_dialect) A Dialectal Arabic Datasets containing four dialects of Arabic, Egyptian (EGY), Levantine (LEV), Gulf (GLF), and Maghrebi (MGR). Each dataset consists of a set of 350 manually segmented and PoS tagged tweets. 2. [UD South Levantine Arabic MADAR](https://universaldependencies.org/treebanks/ajp_madar/index.html) A Dataset with 100 manually-annotated sentences taken from the [MADAR](https://camel.abudhabi.nyu.edu/madar/) (Multi-Arabic Dialect Applications and Resources) project by [Shorouq Zahra](mailto:shorouqjzahra@gmail.com). 3. Parts of the Cairo Students Code-Switch (CSCS) corpus developed for ["Collection and Analysis of Code-switch Egyptian Arabic-English Speech Corpus"](https://aclanthology.org/L18-1601.pdf) by Hamed et al. # Usage # Example # Citation *if you use this model in your work, please consider citing this work:* ```latex @unpublished{MMHU21 author = "M. Megahed and A. Akbik", title = "Sequence Labeling Architectures in Diglossia", note = "In preparation", } ```