metadata
language:
- ar
- en
license: apache-2.0
datasets:
- 4Dialects
- MADAR
- CSCS
thumbnail: >-
https://www.informatik.hu-berlin.de/en/forschung-en/gebiete/ml-en/resolveuid/a6f82e0d7fa446a59c902cac4cafa9cb/@@images/image/preview
tags:
- flair
- sequence-tagger-model
- token-classification
- Dialectal Arabic
- Code-Switching
- Code-mixing
metrics:
- f1
widget:
- text: طلعوا جماعة الممانعة بالسياسة ما بيعرفوا ولا بالصحة بيعرفوا ولا حتى بالدين
Arabic Flair + fastText Part-of-Speech tagging Model (Egyptian and Levant)
Pretrained Part-of-Speech tagging model built on a joint corpus written in Egyptian and Levantine (Jordanian, Lebanese, Palestinian, Syrian) dialects with code-switching of Egyptian Arabic and English. The model is trained using Flair (forward+backward)and fastText embeddings.
Pretraining Corpora:
This sequence labeling model was pretrained on three corpora jointly:
- 4 Dialects A Dialectal Arabic Datasets containing four dialects of Arabic, Egyptian (EGY), Levantine (LEV), Gulf (GLF), and Maghrebi (MGR). Each dataset consists of a set of 350 manually segmented and PoS tagged tweets.
- UD South Levantine Arabic MADAR A Dataset with 100 manually-annotated sentences taken from the MADAR (Multi-Arabic Dialect Applications and Resources) project by Shorouq Zahra.
- Parts of the Cairo Students Code-Switch (CSCS) corpus developed for "Collection and Analysis of Code-switch Egyptian Arabic-English Speech Corpus" by Hamed et al.
Usage
from flair.data import Sentence
from flair.models import SequenceTagger
tagger = SequenceTagger.load("megantosh/flair-arabic-dialects-codeswitch-egy-lev")
sentence = Sentence('عمرو عادلي أستاذ للاقتصاد السياسي المساعد في الجامعة الأمريكية بالقاهرة .')
tagger.predict(sentence)
for entity in sentence.get_spans('pos'):
print(entity)
Example
Citation
if you use this model in your work, please consider citing this work:
@unpublished{MMHU21
author = "M. Megahed",
title = "Sequence Labeling Architectures in Diglossia",
note = "In review",
}