megantosh's picture
lang
871df34
metadata
language: ar
license: apache-2.0
datasets:
  - AQMAR
  - ANERcorp
thumbnail: >-
  https://www.informatik.hu-berlin.de/en/forschung-en/gebiete/ml-en/resolveuid/a6f82e0d7fa446a59c902cac4cafa9cb/@@images/image/preview
tags:
  - flair
  - Text Classification
  - token-classification
  - sequence-tagger-model
metrics:
  - f1
widget:
  - text: >-
      اختارها خيري بشارة كممثلة، دون سابقة معرفة أو تجربة تمثيلية، لتقف بجانب
      فاتن حمامة في فيلم «يوم مر ويوم حلو» (1988) وهي ما زالت شابة لم تتخطَ
      عامها الثاني

Arabic NER Model for AQMAR dataset

Training was conducted over 86 epochs, using a linear decaying learning rate of 2e-05, starting from 0.3 and a batch size of 48 with fastText and Flair forward and backward embeddings.

Original Dataset:

Results:

  • F1-score (micro) 0.9323
  • F1-score (macro) 0.9272
True Posititves False Positives False Negatives Precision Recall class-F1
LOC 164 7 13 0.9591 0.9266 0.9425
MISC 398 22 37 0.9476 0.9149 0.9310
ORG 65 6 9 0.9155 0.8784 0.8966
PER 199 13 13 0.9387 0.9387 0.9387

Usage

from flair.data import Sentence
from flair.models import SequenceTagger
import pyarabic.araby as araby
from icecream import ic

arTagger = SequenceTagger.load('megantosh/flair-arabic-MSA-aqmar')

sentence = Sentence('George Washington went to Washington .')
arSentence = Sentence('عمرو عادلي أستاذ للاقتصاد السياسي المساعد في الجامعة الأمريكية  بالقاهرة .')


# predict NER tags
tagger.predict(sentence)
arTagger.predict(arSentence)

# print sentence with predicted tags
ic(sentence.to_tagged_string)
ic(arSentence.to_tagged_string)

Example

see an example from a similar NER model in Flair

Model Configuration

  (embeddings): StackedEmbeddings(
    (list_embedding_0): WordEmbeddings('ar')
    (list_embedding_1): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.1, inplace=False)
        (encoder): Embedding(7125, 100)
        (rnn): LSTM(100, 2048)
        (decoder): Linear(in_features=2048, out_features=7125, bias=True)
      )
    )
    (list_embedding_2): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.1, inplace=False)
        (encoder): Embedding(7125, 100)
        (rnn): LSTM(100, 2048)
        (decoder): Linear(in_features=2048, out_features=7125, bias=True)
      )
    )
  )
  (word_dropout): WordDropout(p=0.05)
  (locked_dropout): LockedDropout(p=0.5)
  (embedding2nn): Linear(in_features=4396, out_features=4396, bias=True)
  (rnn): LSTM(4396, 256, batch_first=True, bidirectional=True)
  (linear): Linear(in_features=512, out_features=14, bias=True)
  (beta): 1.0
  (weights): None
  (weight_tensor) None
)"
2021-03-31 22:19:50,654 ----------------------------------------------------------------------------------------------------
2021-03-31 22:19:50,654 Corpus: "Corpus: 3025 train + 336 dev + 373 test sentences"
2021-03-31 22:19:50,654 ----------------------------------------------------------------------------------------------------
2021-03-31 22:19:50,654 Parameters:
2021-03-31 22:19:50,654  - learning_rate: "0.3"
2021-03-31 22:19:50,654  - mini_batch_size: "48"
2021-03-31 22:19:50,654  - patience: "3"
2021-03-31 22:19:50,654  - anneal_factor: "0.5"
2021-03-31 22:19:50,654  - max_epochs: "150"
2021-03-31 22:19:50,654  - shuffle: "True"
2021-03-31 22:19:50,654  - train_with_dev: "False"
2021-03-31 22:19:50,654  - batch_growth_annealing: "False"
2021-03-31 22:19:50,655 ------------------------------------

Due to some formatting errors, your code might appear like this.

Citation

if you use this model in your work, please consider citing this work:

@unpublished{MMHU21
author = "M. Megahed",
title = "Sequence Labeling Architectures in Diglossia",
note = "In Review",
}