megantosh's picture
Update README.md
f80b98d
metadata
language:
  - ar
  - en
license: apache-2.0
datasets:
  - 4Dialects
  - MADAR
  - CSCS
thumbnail: >-
  https://www.informatik.hu-berlin.de/en/forschung-en/gebiete/ml-en/resolveuid/a6f82e0d7fa446a59c902cac4cafa9cb/@@images/image/preview
tags:
  - flair
  - PoS-Tagging
  - sequence labeling
  - Token Classification
  - Dialectal Arabic
  - Code-Switching
  - Code-mixing
metrics:
  - f1
widget:
  - text: لائحة «الوطنية للصحافة».. خطوة جديدة في طريق «الحصار»

Arabic Flair + fastText Part-of-Speech tagging Model (Egyptian and Levant)

Pretrained Part-of-Speech tagging model built on a joint corpus written in Egyptian and Levantine (Jordanian, Lebanese, Palestinian, Syrian) dialects with code-switching of Egyptian Arabic and English. The model is trained using Flair (forward+backward)and fastText embeddings.

Pretraining Corpora:

This sequence labeling model was pretrained on three corpora jointly:

  1. 4 Dialects A Dialectal Arabic Datasets containing four dialects of Arabic, Egyptian (EGY), Levantine (LEV), Gulf (GLF), and Maghrebi (MGR). Each dataset consists of a set of 350 manually segmented and PoS tagged tweets.
  2. UD South Levantine Arabic MADAR A Dataset with 100 manually-annotated sentences taken from the MADAR (Multi-Arabic Dialect Applications and Resources) project by Shorouq Zahra.
  3. Parts of the Cairo Students Code-Switch (CSCS) corpus developed for "Collection and Analysis of Code-switch Egyptian Arabic-English Speech Corpus" by Hamed et al.

Usage

Example

Citation

if you use this model in your work, please consider citing this work:

@unpublished{MMHU21
author = "M. Megahed and A. Akbik",
title = "Sequence Labeling Architectures in Diglossia",
note = "In preparation",
}