metadata

language:
  - ar
  - en
license: apache-2.0
datasets:
  - 4Dialects
  - MADAR
  - CSCS
thumbnail: >-
  https://www.informatik.hu-berlin.de/en/forschung-en/gebiete/ml-en/resolveuid/a6f82e0d7fa446a59c902cac4cafa9cb/@@images/image/preview
tags:
  - flair
  - sequence-tagger-model
  - token-classification
  - Dialectal Arabic
  - Code-Switching
  - Code-mixing
metrics:
  - f1
widget:
  - text: طلعوا جماعة الممانعة بالسياسة ما بيعرفوا ولا بالصحة بيعرفوا ولا حتى بالدين

Arabic Flair + fastText Part-of-Speech tagging Model (Egyptian and Levant)

Pretrained Part-of-Speech tagging model built on a joint corpus written in Egyptian and Levantine (Jordanian, Lebanese, Palestinian, Syrian) dialects with code-switching of Egyptian Arabic and English. The model is trained using Flair (forward+backward)and fastText embeddings.

Pretraining Corpora:

This sequence labeling model was pretrained on three corpora jointly:

4 Dialects A Dialectal Arabic Datasets containing four dialects of Arabic, Egyptian (EGY), Levantine (LEV), Gulf (GLF), and Maghrebi (MGR). Each dataset consists of a set of 350 manually segmented and PoS tagged tweets.
UD South Levantine Arabic MADAR A Dataset with 100 manually-annotated sentences taken from the MADAR (Multi-Arabic Dialect Applications and Resources) project by Shorouq Zahra.
Parts of the Cairo Students Code-Switch (CSCS) corpus developed for "Collection and Analysis of Code-switch Egyptian Arabic-English Speech Corpus" by Hamed et al.

Usage

from flair.data import Sentence
from flair.models import SequenceTagger
  
tagger = SequenceTagger.load("megantosh/flair-arabic-dialects-codeswitch-egy-lev")
sentence = Sentence('عمرو عادلي أستاذ للاقتصاد السياسي المساعد في الجامعة الأمريكية  بالقاهرة .')
tagger.predict(sentence)
for entity in sentence.get_spans('pos'):
    print(entity)

Example

Citation

if you use this model in your work, please consider citing this work:

@unpublished{MMHU21
author = "M. Megahed",
title = "Sequence Labeling Architectures in Diglossia",
note = "In review",
}