metadata

language: ar
license: apache-2.0
datasets:
  - AQMAR
  - ANERcorp
thumbnail: >-
  https://www.informatik.hu-berlin.de/en/forschung-en/gebiete/ml-en/resolveuid/a6f82e0d7fa446a59c902cac4cafa9cb/@@images/image/preview
tags:
  - flair
  - Text Classification
  - token-classification
  - sequence-tagger-model
metrics:
  - f1
widget:
  - text: >-
      اختارها خيري بشارة كممثلة، دون سابقة معرفة أو تجربة تمثيلية، لتقف بجانب
      فاتن حمامة في فيلم «يوم مر ويوم حلو» (1988) وهي ما زالت شابة لم تتخطَ
      عامها الثاني

Arabic NER Model for AQMAR dataset

Training was conducted over 86 epochs, using a linear decaying learning rate of 2e-05, starting from 0.3 and a batch size of 48 with fastText and Flair forward and backward embeddings.

Original Dataset:

AQMAR

Results:

F1-score (micro) 0.9323
F1-score (macro) 0.9272

	True Posititves	False Positives	False Negatives	Precision	Recall	class-F1
LOC	164	7	13	0.9591	0.9266	0.9425
MISC	398	22	37	0.9476	0.9149	0.9310
ORG	65	6	9	0.9155	0.8784	0.8966
PER	199	13	13	0.9387	0.9387	0.9387

Usage

from flair.data import Sentence
from flair.models import SequenceTagger
import pyarabic.araby as araby
from icecream import ic

arTagger = SequenceTagger.load('megantosh/flair-arabic-MSA-aqmar')

sentence = Sentence('George Washington went to Washington .')
arSentence = Sentence('عمرو عادلي أستاذ للاقتصاد السياسي المساعد في الجامعة الأمريكية  بالقاهرة .')


# predict NER tags
tagger.predict(sentence)
arTagger.predict(arSentence)

# print sentence with predicted tags
ic(sentence.to_tagged_string)
ic(arSentence.to_tagged_string)

Example

see an example from a similar NER model in Flair

Model Configuration

  (embeddings): StackedEmbeddings(
    (list_embedding_0): WordEmbeddings('ar')
    (list_embedding_1): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.1, inplace=False)
        (encoder): Embedding(7125, 100)
        (rnn): LSTM(100, 2048)
        (decoder): Linear(in_features=2048, out_features=7125, bias=True)
      )
    )
    (list_embedding_2): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.1, inplace=False)
        (encoder): Embedding(7125, 100)
        (rnn): LSTM(100, 2048)
        (decoder): Linear(in_features=2048, out_features=7125, bias=True)
      )
    )
  )
  (word_dropout): WordDropout(p=0.05)
  (locked_dropout): LockedDropout(p=0.5)
  (embedding2nn): Linear(in_features=4396, out_features=4396, bias=True)
  (rnn): LSTM(4396, 256, batch_first=True, bidirectional=True)
  (linear): Linear(in_features=512, out_features=14, bias=True)
  (beta): 1.0
  (weights): None
  (weight_tensor) None
)"
2021-03-31 22:19:50,654 ----------------------------------------------------------------------------------------------------
2021-03-31 22:19:50,654 Corpus: "Corpus: 3025 train + 336 dev + 373 test sentences"
2021-03-31 22:19:50,654 ----------------------------------------------------------------------------------------------------
2021-03-31 22:19:50,654 Parameters:
2021-03-31 22:19:50,654  - learning_rate: "0.3"
2021-03-31 22:19:50,654  - mini_batch_size: "48"
2021-03-31 22:19:50,654  - patience: "3"
2021-03-31 22:19:50,654  - anneal_factor: "0.5"
2021-03-31 22:19:50,654  - max_epochs: "150"
2021-03-31 22:19:50,654  - shuffle: "True"
2021-03-31 22:19:50,654  - train_with_dev: "False"
2021-03-31 22:19:50,654  - batch_growth_annealing: "False"
2021-03-31 22:19:50,655 ------------------------------------

Due to some formatting errors, your code might appear like this.

Citation

if you use this model in your work, please consider citing this work:

@unpublished{MMHU21
author = "M. Megahed",
title = "Sequence Labeling Architectures in Diglossia",
note = "In Review",
}